Over the past few weeks I made some improvements to the twitter mining code (see last post, on building a dataset of media for CV training), and let it “run in the wild”. Over the course of a few weeks it harvested ~2.5 million tweets, and ~1 million images. These tweets and images come from ~5000 users, predominantly furries (seemingly biased towards fursuiters), which have been identified by the program (publicly accessible information only was collected; hashtags and bio keywords used as primary filter, some manual manipulation of the users was performed to ensure purity of collected dataset).
This is already a fairly substantial dataset, sufficient in size to perform some quite complex analysis on. Using, for example, the location of a tweet and hashtags contained within it, one could plot what is regionally trending around the globe.
One could take this a step further, by performing sentiment analysis on the text contained within the tweet, to gauge the fandom’s opinion on the hashtag (topic). Extending, with some insight into the context of a hashtag, if for example it is associated with a convention, deep insights into the convention could be harvested: overall “happiness” of tweets mentioning it, “happiness” of attendees during the convention, “happiness” of foreign attendees, estimates of attendees from regions… this could be a fun and useful thing to do!
textblob (python package) was used to perform the rudimentary linguistic analysis. Sentiment analysis on tweets using textblob.blob.sentiment.polarity. Concurrently, some additional textblob functions were used; language, word.lemmatize, and sentence.tags: to determine sentence language (deal with those damn bi-lingualists!); to stem the word, for later frequency analysis; Point of Speech (POS) tagging, for better text analysis (dealing with self-identifying terms in Twitter Bios is a good example; but also harvesting “negative/positive words” used to describe conventions). Results from this process are stored in additional tables in the database. As this runs as a “worker” service, it processes tweets with a lag, currently ~20% of 2.5million tweets have been text processed.
Current stats:
492,953 tweets have been processed, containing 7, 307,107 words, of which 398,229 are unique (across 5 primary languages: english, japanese, french, spanish, german)
onto looking at hashtags….
hashtags, while scraped and added to tweet db, were not an initial consideration… so no specific tables were created for them, to expedite analysis. Time to build another worker, and some more tables. As data already collected is being processed (hashtags exist as a json string in the tweet db, thus the string needs to be read, split and added to a secondary db, with some extra information, to facilitate quicker queries later). This worker was launched a day go, and has processed ~200,000 tweets, containing 85,000 distinct hashtags; enough to have some kind of insight, although likely to suffer from sample biasing, due to the initial method of building the tweet db (populating by scraping tags: #furries, #fursuit, #fursuitfriday), which is evident from the top 5 results.
The top 14 hashtags (so far…) are:
furry |
5366 |
fursuit |
4575 |
Furries |
2939 |
furryfandom |
2693 |
FursuitFriday |
1936 |
ayokeriau |
1855 |
TMITuesday |
1620 |
helastship |
1416 |
wolves |
1354 |
Quote |
1100 |
wolf |
1014 |
writers |
1004 |
FurriesAreAwesome |
952 |
Zootopia |
924 |
Convention related tags start appearing at 17th position in frequency list, with FurFest, MWFF, AnthroCon being the first 3 to appear.
I expect these frequencies for the hashtags to change drastically as completion is approached for the tweets processing. Currently (as the location feature is not used across all tweets, and sample biasing), there are poor statistics for large portions of the globe, with massive areas without any data, I expect this to change with time. Time statistics are quite sporadic, hashtags clustered in particular months, a consequence of incomplete dataset processing
Sentiment (average) has not been reported for these hashtags for a few reasons, mainly because I feel it will be too unrepresentative, as, soon the hashtag processing worker will overtake the tweet text processing worker (as that is much slower) which provides the sentiment of the tweet text associated with the hashtag.
Visualisations are being prepared, and will be posted when there are more meaningful statistics for sentiment and location.
ToDo: dashboard for readers to navigate data