About Bristrends
Introduction
Being a Bristol student, I've always enjoyed reading Bristruths, both for the amusing content and also for some of the the interesting opinions and perspectives. Put simply, Bristruths is the best single source of information to find out about Bristol students (UWE doesn't count, sorry) and their culture, yet it is a poorly-tamed source of data. There is no good way to see what Bristruths are the most popular or what topics are current without manually looking through all the posts. This, combined with my interest in data science, motivated me to obtain all the publicly available Bristruths and conduct these analyses.
The process of obtaining and analysing Bristruths was not without its difficulties and in the following sections I'll explain each step, walking through the choice of approach and technology.
Getting Bristruths
Bristruths are posts on a Facebook page whose contents are easily readable and accessible to the public, but not readily available to download as a table (i.e. as you would be able to via an API) unless you are a page admin. Having experience with coding in Python, I decided to solve this issue by using the `facebook-scraper` Python package to obtain all the Bristruths, which I saved to a local MySQL database.
But what information did I actually obtain? For each post, I fetched: the text, Facebook post ID, the number of reacts (such as how many "haha" or "sad" reacts a post garnered), the number of shares, the number of comments and the publishing time. Note that I did not save any comments, primarily because they contain lots of names and sometimes even quasi-private correspondences (especially as lots of students tag each other in posts).
Unfortunately, due to Facebook not showing every published post, I could only obtain a fraction of all 30k+ posts - in the end, I obtained 2027 posts.
Analysing Bristruths
Most of the initial analysis was simply conducted in Python with the help of the forever-useful Pandas library for data manipulation. This let me sort the posts by reacts, find the average number of total reacts, and also see posts from any chosen time frame. Browsing through posts from different years gave me the inspiration to sort them into categories, as it was clear that there were particular types of posts - the most obvious being ranking posts, react voting posts and meme posts (which I did not include in the list of categories as there were simply too many formats for me to identify them all accurately).
Initially, I attempted to use different algorithms (such as Latent Dirichlet Allocation) to automatically identify categories based on the common co-occurence of different words. Unfortunately, these automatically-generated categories did not often make sense - they grouped posts on words that often had little influence on what "type" of post it was from a human perspective. Consequently, I started to manually determine categories based on what I had personally seen in Bristruths.
To determine whether a post was part of a category (such as politics), I devised a list of keywords for each category and then check if there were any matches (the politics keywords list included the words "Labour","Conservative","Election"), then it would be put in that category. This method was used for all categories excluding the "react voting" category, where I checked the text for tell-tale signs of react voting (such as the presence of the word "react" or multiple react emojis).
Unfortunately, this keyword-match method of identifying categories is not perfect as one word can be used in multiple contexts (such as "labour") and hence there were some false matches that I identified and removed manually.
When manipulating text, I often made use of Regular Expressions via the `re` package to see if the text matched a certain input pattern, and also the `emoji` package for converting emojis to text, thus making it easier to process.
Presenting the Data
Once the data was processed, all that was left was to present it in a readable and accessible format - hence I plotted various graphs to illustrate the data. The graphs are all made in `Bokeh`, another Python package which has the advantage of generating interactive graphs that can be embedded in webpages without resorting to complicated solutions.
This website is hosted on Github Pages, which was the perfect solution for me as it allows for static websites to be hosted easily directly from a Github repository, which I used for managing the project files.
Questions?
If you have any questions, suggestions, concerns or general inquiries, please send an email to info@bristrends.co.uk