Python: New CV Project!

Intro

I have to get more heavily into CV, and show some kind of portfolio to help with getting a better job. So I thought “hey, why not make a fursuiter tagger for all those EF pictures and videos you took”. Ticks all the boxes with getting into CV more heavily; nice bit of practice for scraping (training images) and db building too (lol, a graphDB for social network buildng from tagged photos).

Details, concerns, etc.

I’ll defer ethical quandries about dealing with artistic IP by stating I think it’d be polite to ask permission to attempt the identification (naming) of fursuiters in photos; for the more general problem of training a “fursuiter recogniser” publicly accessible photos will be harvested (from multiple sources) and used to train CV algorithms.

So I ask a few furs. TaoruPanda was nice enough to volunteer to participate for identification. Other volunteer are welcomed! Off to get photos… TaoruPanda has a twitter (common among furries) and a FurAffinity account (ubiquitous among furries), both of these will be scraped.

Twitter can be used for building the general training set of images, tags #fursuit, #fursuitfriday (for example) are practically assured to contain a positive image of a fursuiter for this purpose. Thus, twitter shall be scraped for building a training set of images.

Aside: twitter is also interesting due to associated meta data and text found in the tweets. 

Python: Scraping twitter

You will need an account for twitter to programatically access their data (through REST APIs). Further to having an account, you need to enable API access in developer options (this requires associating a phone number with the twitter account, used for a verification code). OAuth used for authentication, makes easier to login with Python, wrappers for the API exist tweepy and python-twitter are nice.

tweepy: nice and easy to use; oversimplified and abstracted some API endpoints, makes trickier for this need.

python-twitter: used in the code below, mainly because natively deals with JSON and exposes full API.

Decide what data is needed:

  • Tweeted media from a specific user
  • Text, hashtags and metadata for tweeted media
  • Tweets without media (?)

This data then needs to be stored for later analysis: media files and dumps of tweets are stored locally (on computer) as images organised in folders, and as JSON dumps respectively; tweet data is stored in databases for quick retrieval, sorting, etc. Two principle database types will be used here, SQL and Graph -type databases. SQL (MYSQL) for flat data storage, graph (Neo4J) for network / topological analysis.

Strictly speaking, for the CV Project, SQL  and tweets with media are all that are required.

<<code to be posted later in entirety>>

Once a sufficiently large collection of images has been made (both with, and without fursuiters) image processing can start. Initially this will involve manual tagging of suiters in images, until a sufficient number of known suiter images is built up to train the CV algorithms.

Stay tuned for next time for: dirty nitty-gritty details, prelim results, problems encountered…. or something like that