While I was at Wikimania last week I was talking to a sociologist who is researching open source contributions. Turned out he’d never heard of Ohloh.net so I was glad to be able to introduce him to it. Ohloh, if you’re not familiar, is a site that reports on contributions to a wide range of open source projects over time, by scraping information from version control repositories. It has over 300k projects listed and almost 400k contributors.
Yesterday, when looking at Ohloh, I wondered whether we could guess anything about the gender of contributors from their user profiles there. So I set up a little experiment. Using the Ohloh API, I extracted a bunch of account data, then grabbed a small sample (100 accounts) to mess around with. (I didn’t worry too much about real randomness at this point, as it was just a proof of concept.)
Next I created a Mechanical Turk job where I asked participants to look at Ohloh profile pages and see if they could figure out the gender of the user based on username, avatar, or any other means. I got three people to look at each profile, paying 5c each, so the cost to me was 100 * 15c = $15, plus Amazon’s fee brought it to $16.50.
The results came back in about an hour. I downloaded them and ran them through a quick little Perl script. In any case where at least two of the Mech Turk workers had agreed on “Male” or “Female”, I counted the user as that gender. If the workers couldn’t agree or couldn’t tell, I counted the user as “unknown”.
My results for the test batch of 100 users:
Turns out it’s hard to tell gender from Ohloh profiles! Some of them are truly impossible — usernames that are just initials, for instance, and profile pages not filled out at all. And sometimes my MT workers just seemed to have odd opinions, or didn’t know much about names. For example, they all marked someone named Didier Durand as “?” although that is a common French masculine name. Similarly, someone named Pavel Shiryaev also came through as unknown, with two “?” and one “M” though Pavel is the Russian version of the masculine name “Paul”. dianelamb320 got a vote each way for “M”, “F”, and “?”, also resulting in “unknown”, though I would have guessed female. On the other hand, svpavani came through as female (two “F”, one “?”) and I can’t for the life of me figure out why, as there is nothing on the profile page to indicate it.
So… with 71% unknown, I don’t really feel this was successful enough to extend to a wider sample, given that it costs real money to do so. But I do think it was interesting that in the small and not-particularly-random sample I used, 5% were clearly feminine usernames (that is, 6% minus “svpavani”). This is considerably higher than the 1.5% of female contributors usually cited from the FLOSSPOLS survey.
What do you think? Would it be worth trying again with a larger sample? Do you have any ideas for how to get fewer “unknown” responses without compromising the data? Any other ideas on how we could mine Ohloh’s account information to learn things about gender?