Gender and

While I was at Wikimania last week I was talking to a sociologist who is researching open source contributions. Turned out he’d never heard of so I was glad to be able to introduce him to it. Ohloh, if you’re not familiar, is a site that reports on contributions to a wide range of open source projects over time, by scraping information from version control repositories. It has over 300k projects listed and almost 400k contributors.

Yesterday, when looking at Ohloh, I wondered whether we could guess anything about the gender of contributors from their user profiles there. So I set up a little experiment. Using the Ohloh API, I extracted a bunch of account data, then grabbed a small sample (100 accounts) to mess around with. (I didn’t worry too much about real randomness at this point, as it was just a proof of concept.)

Next I created a Mechanical Turk job where I asked participants to look at Ohloh profile pages and see if they could figure out the gender of the user based on username, avatar, or any other means. I got three people to look at each profile, paying 5c each, so the cost to me was 100 * 15c = $15, plus Amazon’s fee brought it to $16.50.

The results came back in about an hour. I downloaded them and ran them through a quick little Perl script. In any case where at least two of the Mech Turk workers had agreed on “Male” or “Female”, I counted the user as that gender. If the workers couldn’t agree or couldn’t tell, I counted the user as “unknown”.

My results for the test batch of 100 users:

6 female
23 male
71 unknown

Turns out it’s hard to tell gender from Ohloh profiles! Some of them are truly impossible — usernames that are just initials, for instance, and profile pages not filled out at all. And sometimes my MT workers just seemed to have odd opinions, or didn’t know much about names. For example, they all marked someone named Didier Durand as “?” although that is a common French masculine name. Similarly, someone named Pavel Shiryaev also came through as unknown, with two “?” and one “M” though Pavel is the Russian version of the masculine name “Paul”. dianelamb320 got a vote each way for “M”, “F”, and “?”, also resulting in “unknown”, though I would have guessed female. On the other hand, svpavani came through as female (two “F”, one “?”) and I can’t for the life of me figure out why, as there is nothing on the profile page to indicate it.

So… with 71% unknown, I don’t really feel this was successful enough to extend to a wider sample, given that it costs real money to do so. But I do think it was interesting that in the small and not-particularly-random sample I used, 5% were clearly feminine usernames (that is, 6% minus “svpavani”). This is considerably higher than the 1.5% of female contributors usually cited from the FLOSSPOLS survey.

What do you think? Would it be worth trying again with a larger sample? Do you have any ideas for how to get fewer “unknown” responses without compromising the data? Any other ideas on how we could mine Ohloh’s account information to learn things about gender?

17 thoughts on “Gender and

  1. Rick

    How many of the usernames fell into the class of “truly inscrutable” and how many into the class of “discernable with some additional investigation”? If it’s more the latter, perhaps we could crowdsource this out to the participants here on the Geek Feminism project..? We surely wouldn’t be able to get through a huge swath of ohloh’s user base, but with a random sample, we might have enough results to be interesting.

  2. Liz

    As we talked about on IM I do think it would be interesting and worth it to suck all the profile names out of there and run them against the biggest gendered name DBs we can get our hands on. I’d try it first with the subset of names that are clearly demarcated as first and last names, as well as by treating the entire profile name as a first name. Then, any first names that show as male and female, or don’t show up, count as uncertain; ones that are female but not male count as female. I think that would give a fairly large sample.

  3. Sean

    Seems like it would be really interesting data to have, but rather tricky to do it manually (mechanical-turkishly?). Limiting it to only users who have at least one contribution might help make the size of the dataset more managable (and remove accidental duplicates).

    It’s also a pretty tricky call from a profile, like your data and blog-post suggest. As an additional example, my Ohloh profile has a woman in the icon and yet I’m male.

    Honestly, it might be easiest to just ask the Ohloh guys to roll in some optional age/gender demographics into the profiles so that aggregate stats can be charted even if those things are never displayed on individual profiles (since it doesn’t seem to make too much sense to display them).

    Interesting project idea though, if you do get a solid estimate that’d be some really interesting info!

    1. Skud Post author

      Ooh, that looks useful in a non-US-specific way! I like how they are crowdsourcing information for things they didn’t know about first time round.

  4. Mary

    Re using Mechanical Turk, there’s been research, meta-research I guess, on doing effective research with it. Having read some of that, one of the reasons you get ‘unknown’ even when the answer seems obvious is likely that answering ‘unknown’ is faster than looking at the profile and coming to some kind of decision.

    Typically this problem is avoided by making all answers equally hard to give, such as requiring them to explain their answers or provide evidence. Of course, you then may have to pay more per answer.

    1. Asad

      The interwebs meet real life again. I recently ran a named-entity detection experiment under exactly the same conditions as Skud via Mechanical Turk. I had paid undergrad programmer sidekicks compare the results to a directly-hired human annotator and they found that majority-vote was a reasonable substitute for human annotation via Cohen’s kappa. I suspect that the…ethnography of the names was the main problem in Skud’s task, not annotator laziness. FWIW I’ve spent a lot of time around francophones and never heard of “Didier”, but then maybe that name didn’t transfer over to Québec.

  5. Jonquil

    Offtopic, but interesting: a blogger complains about the underrepresentation of women in links from the big-name bloggers.

    The interesting twist? She’s a woman blogging about the Bible, and she’s facing the same old same old arguments, only worse because people can shut her down immediately by just saying the word “feminist”.

    1. Cesy

      And I suspect, like the recent similar discussion about women blogging about politics in Australia, part of the issue is that women blogging about theology are similarly invisible to the male blogosphere. I have at least a few on my reading list, either on LJ or DW.

  6. Ana

    I am not surprised, I use “ana” as nickname on IRC (also is my user in ohloh, FWIW) and from time to time I get asked if I am female. I tend to ignore the question since it is irrelevant.
    In the “real name” field my name and surname is set, but it still makes me wonder what to think…

Comments are closed.