Page Protection: Another Missing Dimension of Wikipedia Research ------------------------------------------------------------------------- | Author: Benjamin Mako Hill and Aaron Shaw | Homepage: http://communitydata.cc/wiki-protection/ | Archival copy: http://dx.doi.org/10.7910/DVN/P1VECE | Code: http://projects.mako.cc/source/?p=protection-tools | License: GNU GPLv3 or any later version | Description: Tools to generate a protection spells dataset from "raw" MediaWiki XML dumps like those published by the Wikimedia foundation. General Information ---------------------- page protection log data is very uneven and changed its format many times. For several years before 2008-09, data on the specific rights were recorded in comment fields but this was subject to incorrect parsing because people sometimes added additional information (or tried to reproduce the format "by hand" as well). This code is limited to extracting and parsing only the most reliable information which means only protection data from 2008-09 and the point of data collection. There are two sources of page protection data: - A log file that contains log events including protections, moves, deletions, blocks, etc. - A snapshot from Wikipedia MySQL database of protected pages at the point that the database (and log) was exported. As mentioned above, our log data begins only in 2008-09 which means we have data on protection spells that is both right censored (i.e., ongoing protection events) and left censored (i.e., protect spells that were ongoing in 2008-09. Because protection data was not recorded reliably over Wikipedia's history, we believe that avoiding censoring (either left or right) is not technically possible given the data sources the WMF has published. |---------|----------------|-----> 2004 2008 2015 |--Our Dataset---| Because we don't have a perfect data for the beginning of any window, it means that events that happened are invisible to us unless they are ongoing at the point of data collection. When our dataset starts in 2008 (as it does in the dataset we produce here where the log format stabilizes, this means that there are two conditions where data will be missing (refer to the timeline figure above): a) If a page is protected in and is deleted in we wll have no indication that the page was protected at all and no page protection spells will be included in the database. b) If a page is protected in and its protection status is changed at any time during period (i.e., either by adjusting the rights, updating the expiration time, or unprotecting the page) we will know that the page was protected during the period from the beginning of until but will we not know the specific rights associated with the protection. We will have complete data on the protection status of that page from onward. Running the Software ----------------------------- 0. Install dependencies =========================== - Python 3 - GNU R - `data.table` R package available on CRAN There is also a dependency on a file called `wikipedia_namespaces.csv` which is included in this repository and which is drawn from data on this page: https://en.wikipedia.org/wiki/Wikipedia:Namespace This file is taken from English Wikipedia in 2015. If you are working with different wikis or with an updated dump, you will likely to need to update this file. 1. Download Dumps ========================== First, download a dump of MediWiki log actions. WMF distributes these for English Wikipedia at http://dumps.wikimedia.org. For example, the latest version when we wrote this document was this file: https://dumps.wikimedia.org/enwiki/20150112/enwiki-20150112-pages-logging.xml.gz You will also need a page that includes the status of protected pages at the point that data was created in the database. WMF distributes these from http://dumps.wikimedia.org as well. For example, the latest version is here: https://dumps.wikimedia.org/enwiki/20150112/enwiki-20150112-page_restrictions.sql.gz Finally, because the table above maps only page ID numbers to restriction events, we'll need a mapping of page IDs to page titles which is contained in a table like this one (the latest at the time of writing): https://dumps.wikimedia.org/enwiki/20150112/enwiki-20150112-page.sql.gz 2. Parse SQL Tables ============================ An important first step is parsing the two wQL dumps file into CSV format. We can use the `mysqldump_to_csv.py` to do this like: $ zcat enwiki-20150112-page_restrictions.sql.gz | ./02-mysqldump_to_csv.py \ > enwiki-20150112-page_restrictions.csv $ zcat enwiki-20150112-page.sql.gz | ./02-mysqldump_to_csv.py \ > enwiki-20150112-page.csv The first file is small and should be very quick. The second file is larger but still should still take only several minutes. The number is adapated from: https://github.com/jamesmishra/mysqldump-to-csv 3. Parse Log File ================================= The log file that contains changes over time is much larger and will take several hours to parse for English Wikipedia even on a very fast machine. The page log can be parsed using the file like: $ zcat enwiki-20150112-pages-logging.xml.gz | ./03-parse_mw_event_log.py This will produce several TSV files of the log file in several different formats: output-moves.tsv output-protection.tsv output-deletions.tsv If you pass an XML filename to 03-parse_mw_event_log.py, these file will not be called output but something based on the root of the filename. 4. Import Data Into R ============================= Import all of the data that we've created into a series of RData files. You'll need to first edit the file `04-import_data.R` so that the input files (all defined at teh very top of the script) match the files that you've downloaded and created. Once you've done that, you can run the R script which will load and process these: $ R --no-save < 04-import_data.R This will output an RData file called `processed_log_and_sql_data.RData`. 5. Generate Spells ============================= Finally, we run the command that reads in all the prepared dataset and generates the spells dataset: $ R --no-save < 05-generate_spells.R This will generate the final page protection dataset: `spells.RData`