The role of English Wikipedia’s top content creators in perpetuating gender bias

Top Wikipedia editors compared to the rest of the community, including bots

Top Wikipedia editors compared to the rest of the community, including bots. Source: Wikipedia Signpost, January 22, 2014 URL:

Guest post by Laura Hale

According to the Wikipedia user “Ktr101“, the top 5,000 article creators on English Wikipedia have created 60% of all articles on the project.  The top 1,000 article creators account for 42% of all Wikipedia articles alone.

Wikipedia has a well-known gender gap when it comes to contributors who are editing and writing articles.

“Ktr101” made the connection between the two issues in their piece for the community-written Wikipedia newspaper Signpost, saying:

” With the already low numbers of females on the site, this means that there will be more coverage of male-oriented topics. If an article is not covered immediately, there is a good chance that it will be created in the coming years. Unfortunately, this means that whatever female-oriented topics are out there will probably get further neglected, as there is less of a chance that someone will even know that the subject exists, never mind it being notable enough for an article (when in doubt, go for it). The amount of these super page creators only exacerbates the problem, as it means that the users who are mass-creating pages are probably not doing neglected topics, and this tilts our coverage disproportionately towards male-oriented topics.”

This does bring up the question: How bad is the gender gap in terms of article creation by Wikipedia’s top content creators?  Are “super users” exacerbating the problem by overwhelming creating new articles at males and not creating large numbers of articles about women?

The easy answer to that question is to get the percentage breakdown by gender for of all of English Wikipedia’s top 5,000 editors.  This is easier said then done for a number of reasons. The first is the ability to easily label articles as male, female and neutral.  Some of this will be inherently subjective.  Some of it might actually require content analysis, because an article about say “Netball in Jamaica” could have been primarily written by someone interested in the men’s game despite the sport being historically female.  In that case, the article could be turned on its head and simple female coding for female could be wrong.  If just doing it from a list, it requires a lot of knowledge about names and verifying gender facts.  Lindsay is one of those unisex names that can be male or female.  If a person is writing mostly about Australians, the name is probably going to be male.  If a person is writing about U.S. Americans, then it will probably be female.  Again, cultural knowledge or authentication by viewing the article is needed.  Then there is the purely subjective stuff: Should “Sex and the City” be female, should “Futurama” be male or should “West Wing” be gender neutral?  Should the Abbott Ministry be male because Tony Abbott is male and most of the ministry is male (and some policies are seen as anti-female) or should it be gender neutral because women are on it and a ministry is not inherently a sexed concept? Such coding is inherently problematic and makes potential replication very difficult, especially since we are not looking at a few articles but thousands of unique articles.  Any research realistically may not be able to be duplicated.

Despite that, the question is still worth answering and worth considering.  I wanted to do this, but given the time constraints because of some of the coding issues mentioned above, I was only able to examine the contributions of 20 of the top 5,000 contributors.  This sample size represents only 0.4% of all people on that list.  To give an idea as to the top 5,000 article creators on the list, the mode number of articles created was 101, the median was 108 and the average was 4,009.  Across all 5,000 article creators, this is not quite a match.  For the 5,000 the mode was 107, median was 205 and the average was 536.  The quartiles for the sampled population are 101, 108, 1315.5, and 40016.  For all 5,000 contributors, they are 135, 205, 400.5 and 94,756.  In all, 80,196 articles were included in this sample.  Not an exact representative sample but for my purposes of trying to begin to understand patterns and hoping to encourage others to continue this research, it is good enough.

For my purposes, women’s articles are defined as biographies about women, articles about groups of women, things heavily featuring women, articles about fictional women, or articles that almost entirely discuss only women. Example: Hillary ClintonCanberra CapitalsThe Good WifeLisa SimpsonAfrican-American women in politics. The same applied for articles about men.  Neutral gender articles were articles that did not fit into these categories.

Using this criteria, 1,412 articles were identified as female, 4,595 were identified as male, and 74,189 were identified as gender neutral.  On the face of it, woot, woot.  Ignoring the gender-neutral articles, 23.5% of all articles were about women.  This certainly beats the estimated contributor gender gap.  Except the data suggests this is factually no true in terms of “super users” creating articles about women. Of those in the sample, 5 people did not write an article that was gendered either way.  Four people wrote zero articles about women but did write articles about men.  That puts it at 45% of the sampled contributors not writing about women (and men), and of the people writing a gendered article, 26% of them not writing about women.  This is where a bigger sample size would probably come in handy, but it is still a bit depressing.

When looking at gendered article writers only for their gendered content, only one contributor was at 50% of their articles being about women.  The next closest created 38% of their articles about women.  The third was at 25%.  The fourth most popular was 19% and the fifth was 13%. That rounds out the top 25% of creators of content about women.  The remaining 75% (including our non-gendered writers) average 2.6% of their content about women.  The remaining 75% writing about gendered topics write 4.7% of their content about women.

English Wikipedia’s “super users” are not contributing much to content focusing on women. This is problematic on multiple levels.  The first is return on investment (ROI).  A lot of money is currently being spent on encouraging new contributors to come to the project and write articles about women. There are edit-a-thons and training sessions and wiki stormings.  All of these cost in terms of volunteer hours and money.  Research shows that edit-a-thons are not actually very cost productive in terms of generating new content and developing a new cohort of users.  A lot of times, articles developed at these events get deleted or nominated for deletion within seconds of going live.  The return on investment is very high to create a cohort of new users to fix the gap.

That isn’t to say that women should not be recruited and should not be encouraged to add articles about women to Wikipedia.  They absolutely should. On some level, the more this editing is normalized, the better.

It just is not a cost and time effective solution to fixing the representation gap for women on Wikipedia.  The best option is to encourage the top 5,000 editors to create articles about women and to incentivize this group. The sheer volume of articles they have created indicates they have a good understanding of what makes a person or topic notable for the purposes of being eligible for an article.  They do not need to learn the interface because they probably mastered it on their way to creating these articles.  They have accumulated reputation that for a number of them makes their articles much less likely to be deleted.  The group is clearly passionate about Wikipedia, enough to create a large number of articles.  The costs to get them to switch over to creating content about women is probably much lower.

The second problem, once return on investment is out of the way, is one Ktr101 alludes to: If top content creators continue with their current contribution patterns, the under representation of women is likely to get worse, not better.  If one assumes a new article creation rate of only 0.1% (including non-gendered) or 8.9% (excluding non-gendered) articles are about women, it means that the remaining non-“super users” who have only created 40% of the existing articles need to fill the gap. And existing research on Wikipedia editor recruitment and retention suggests this is just not a feasible solution.  Despite all the efforts to recruit and retain editors, it just isn’t happening.  More and more articles are being created by “super users” and there is no growth pattern that suggests this option of relying on new users is not feasible.

The third issue is relying almost exclusively on new contributors to create new content women as a way of offsetting the gender imbalance does nothing to address perception problems related to Wikipedia being male and cliquey. Using business jargon, Wikimedia Foundation provides a service: free knowledge for public consumption.  The service has stakeholders, a key group of which are the elite content creators.  The “super users” in this elite content creating group provide 60% of Wikipedia’s content.  They provide most of the material for public consumption for another one of Wikimedia’s key stakeholders which are colloquially known as readers.  In this area, the two groups of key Wikimedia stakeholders are actually acting counter to the goal of the Wikimedia Foundation because one group is actively not providing information that another wants.  Worse yet, because of behaviors by one group (or at least the perception of their behaviors), it hurts the ability of the Wikimedia Foundation to grow readers and to grow another stakeholder group, regular and new-contributors.  One of the ways to offset this gender imbalance that creates this perception problem and lack of information problem is to change not reader desires but the behavior of the “super users” who are perceived as “being” Wikipedia.  And after these “super users” create the articles about women, highlight them and talk up their work.

English Wikipedia’s top content creators play a role in perpetuating gender bias on the project, and steps should be taken to do more research on the project and to understand the implications of what this means in a broader gender gap perspective.

Laura Hale is a Ph.D. student at the University of Canberra, who is studying sport and social media. As a Wikipedian, she has created over 1,200 articles with over 40 percent of them about women.  She has served as a Wikipedian in Residence for the Australian Paralympic Committee and the Spanish Paralympic Committee.  She is also active in a leadership role in the Wikimedia movement, having served as the vice president of Wikimedia Australia, and the provisional chairperson of The Wikinewsie Group.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s