Introduction
At EF23 I wanted to go to see a talk given by XX on analysis performed on e621’s posts and tags, I didn’t make it (hungover I think). Berf, oh well, I read a little in the EF newspaper on the talk, and forgot until I refound the newspaper clearing office. That motivated me to take a little look at e621 scraping, and what I can learn from the tags. Bonus: i’m learning to create interactive data analysis dashboards with python, this could be a neat little project!
Scraping e621
I need to get the post data associated with each image posted to e621, thankfully an API is exposed by e621 to facilitate automated (bulk) harvesting of their data. A scraper was written to acquire the post data, respecting server load concerns (a 200 posts per 10s rate limit was imposed upon the scraper), and left running… Over the course of 2 months, 500,000 posts were ingested and their metadata stored in a mySQL database, image files were not downloaded. A failure rate of approximately 2% was accepted (encoding errors for weird characters, post description overfilling the database field length of 3000 characters, etc.) as it was deemed the remaining 98% of posts will provide a sufficiently representative sampling of the whole.
A graph of the number of posts, arranged by year and month is presented below:
Interesting, this tells us a couple of things at first glance: there are some busy periods for posting (spikes), and there’s a general upward trend in the number of posts per month (fandom becomes more popular / more art is produced / more art is posted to the site as it becomes more popular).
Now we have the posts ingested, time to look at some other stuff. What are the most popular tags used on e621? I’ll query the top 15, and collect the rest of the data as “other”, and plot as a pie chart. Top 15 tags, Other presented with their percentage occurence in the legend.
Cool. Lets look at how the top 5 of these tags change with time. “Anthro” follows the shape of the post vs. time curve, and is basically a defining tag present on each post. The others show more muted variation following loosely the shape of the posts v time plot, with some subtly changes; for example male and female switch in relative popularity at a couple of points. This could warrant further investigation.
Hmmm… Lets take a look at the species present in the e621 posts. Thankfully e621’s metadata differentiates between species, copyright, character, and descriptive tags.
Closing statement
OK, that’s enough for a quick look at the e621 data. I’ll continue looking deeper, have some near ideas, and will make these interactive plots available too ^.^