UN Global Pulse wanted to evaluate if big data sources like Wikipedia could be used to inform policy decision.
My contribution was an exploratory data analysis of Wikipedia edit data and usage statictics to help identify topics with high traffic and conflict. One of the work items was to visualize a Wikipedia edit war so policy makers could understand the level of conflict and identify key actors in the contributions and upkeep of a page.
Visualizing a Wikipedia edit war
In the above edit, the user Glassleila corrects two words added by 220.127.116.11. What you see is a Wikipedia edit diff between two consecutive commits.
An edit war occurs when editors who disagree about the content of a page repeatedly override each other's contributions.
I used a python Wikimedia library called Pywikibot to scrape existing edit data and compare consecutive edits to build a weighted adjacency matrix. The adjacency matrix was weighted by a simplistic rubric - insert strength and delete strenght; mesuring how many characters of data were inserted by the new edit and how many characters were deleted from the old edit.
Exploring visualization constructs
There are several existing visual constructs to represent data that represents interactions between multiple parties.
A visualization of uber rides by Mike Bostock.
Candidates for visualization constructs
- Force Directed Graph : This is a network/node diagram with responsive movable nodes. Can be used to represent connections between nodes, however representing the strength and nature of these connections is harder and can also be hard to read.
- Sunburst Diagram : This is more sophisticated form of a pie chart with subjections projecting outwards. Cannot intuitively represent inter-section relationships.
- Heatmap Matrix : This is a very direct representation of the raw weighted adjacency data. Reading heat-encoded data may not be the friendliest experience for people, since subtle differenced in data are often blended into similar colors.
The winning visual construct
- Chord diagram : This is a diagram where a circular whole is bisected into sections, each section connecting to the other if applicable with a chord. The thickness of this chord represents the magnitude of this connection. This was chosen as the form of visualization because it effectively represents the nature of inter-party interaction by allowing us to encode the magnitude of these interactions using varying thickness of the chord. Also in-bound interactions by a party can be represented as an inbound bezier curve instead of a connecting chord. This diagram is also very easy to read because parties of interest can be highlighted by hovering over the arc that represents the party. The above image of uber rides is an example of a chord diagram.
Crawling Wikipedia and analysing data
The python programs processes consecutive edits to calculate a difference between them. The result of the crawling and data processing is an adjacency matric which also carries weight data.
When user adds to previous edit
Suppose a user User 1 adds 90 characters to the commit made by the last edit, this data does not overwrite the contents of the previous edit. This is thus calculated as insert strength and added to the [User 1, User 1] position of the adjacency matrix.
When user deletes from a previous edit
Suppose a user User 1 deletes 30 characters from the edit made by User 2. In the next edit, User 2 make an edit that deletes 4 characters from the previous edit. This is represented in the delete strength at [User 1, User 2] and [User 2, User 1], this does not preserve the directionality of the edits as we are interested in identifying edit wars.
How to read this
Edits made on English wiki chil marriage page, Jul 2010 - Jul 2011
In the above diagram, for the highlighted arc for 18.104.22.168, you can see the inbound curve representign the added data by all edits made by this user. The user also has had major edit interactions with ClueBot NG, seems to be the only user whose edits have been worked by the user ClueBot NG. Edit wars can be identified as thick and equally distributed chord connections between two entities. There are no edit wars visibly identifyable in the above diagram.
Other work included making scatter plots for wikipedia visit data, translations done for Hindi and Marathi language keywords for twitter analysis. The volunteering stint concluded with a written report detailing the feasability of useing Wikipedia as a big data source.
Potential and challenges
A continued challenge is representing the directionality of the edits along with the magnitude of the edits - without complicating the visual representation. Currently the diagram serves us right in identifying the level of interactions and edit wars on a page. However, it is not possible to know if the edits have been done by user A on B or vice versa. Since time also play an important role in identifying an edit war - a better visualization can incorporate that dimension as well.