Data Visualization for UN Global Pulse
Sep 2016
"Featured on UN Global Pulse blog. Can big-data sources be used to inform policy making?"
PROJECT STATS
  • Team size
  • 1
  • My role
  • Visualization design and data wrangling.
  • Timeline
  • Sep 2016
  • Methods used
  • Data analysis, visualization design, information architecture
  • Client
  • UN Global Pulse

Project's vision

UN Global Pulse wanted to evaluate if big data sources like Wikipedia could be used to inform policy decision.
My contribution was an exploratory data analysis of Wikipedia edit data and usage statictics to help identify topics with high traffic and conflict. One of the work items was to visualize a Wikipedia edit war so policy makers could understand the level of conflict and identify key actors in the contributions and upkeep of a page.

Visualizing a Wikipedia edit war

In the above edit, the user Glassleila corrects two words added by 203.220.72.109. What you see is a Wikipedia edit diff between two consecutive commits.

An edit war occurs when editors who disagree about the content of a page repeatedly override each other's contributions.

I used a python Wikimedia library called Pywikibot to scrape existing edit data and compare consecutive edits to build a weighted adjacency matrix. The adjacency matrix was weighted by a simplistic rubric - insert strength and delete strenght; mesuring how many characters of data were inserted by the new edit and how many characters were deleted from the old edit.

Exploring visualization constructs

There are several existing visual constructs to represent data that represents interactions between multiple parties.

A visualization of uber rides by Mike Bostock.

Candidates for visualization constructs

The winning visual construct

Crawling Wikipedia and analysing data

The python programs processes consecutive edits to calculate a difference between them. The result of the crawling and data processing is an adjacency matric which also carries weight data.

When user adds to previous edit

Suppose a user User 1 adds 90 characters to the commit made by the last edit, this data does not overwrite the contents of the previous edit. This is thus calculated as insert strength and added to the [User 1, User 1] position of the adjacency matrix.

When user deletes from a previous edit

Suppose a user User 1 deletes 30 characters from the edit made by User 2. In the next edit, User 2 make an edit that deletes 4 characters from the previous edit. This is represented in the delete strength at [User 1, User 2] and [User 2, User 1], this does not preserve the directionality of the edits as we are interested in identifying edit wars.

How to read this

Edits made on English wiki chil marriage page, Jul 2010 - Jul 2011

In the above diagram, for the highlighted arc for 79.173.228.159, you can see the inbound curve representign the added data by all edits made by this user. The user also has had major edit interactions with ClueBot NG, seems to be the only user whose edits have been worked by the user ClueBot NG. Edit wars can be identified as thick and equally distributed chord connections between two entities. There are no edit wars visibly identifyable in the above diagram.

Other work

Other work included making scatter plots for wikipedia visit data, translations done for Hindi and Marathi language keywords for twitter analysis. The volunteering stint concluded with a written report detailing the feasability of useing Wikipedia as a big data source.

Potential and challenges

A continued challenge is representing the directionality of the edits along with the magnitude of the edits - without complicating the visual representation. Currently the diagram serves us right in identifying the level of interactions and edit wars on a page. However, it is not possible to know if the edits have been done by user A on B or vice versa. Since time also play an important role in identifying an edit war - a better visualization can incorporate that dimension as well.