projects.mako.cc - protection-tools/blob - README

   1 Page Protection: Another Missing Dimension of Wikipedia Research
   2 -------------------------------------------------------------------------
   3
   4 | Author: Benjamin Mako Hill <mako@atdot.cc> and Aaron Shaw <aaron.d.shaw@gmail.com>
   5 | Homepage: http://communitydata.cc/wiki-protection/
   6 | Archival copy: http://dx.doi.org/10.7910/DVN/P1VECE
   7 | Code: http://projects.mako.cc/source/?p=protection-tools
   8 | License: GNU GPLv3 or any later version
   9 | Description:
  10
  11     Tools to generate a protection spells dataset from "raw" MediaWiki
  12     XML dumps like those published by the Wikimedia foundation.
  13
  14 General Information
  15 ----------------------
  16
  17 page protection log data is very uneven and changed its format many
  18 times.  For several years before 2008-09, data on the specific rights
  19 were recorded in comment fields but this was subject to incorrect
  20 parsing because people sometimes added additional information (or
  21 tried to reproduce the format "by hand" as well).
  22
  23 This code is limited to extracting and parsing only the most reliable
  24 information which means only protection data from 2008-09 and the
  25 point of data collection.
  26
  27 There are two sources of page protection data:
  28
  29 - A log file that contains log events including protections, moves,
  30   deletions, blocks, etc.
  31
  32 - A snapshot from Wikipedia MySQL database of protected pages at the
  33   point that the database (and log) was exported.
  34
  35 As mentioned above, our log data begins only in 2008-09 which means we
  36 have data on protection spells that is both right censored (i.e.,
  37 ongoing protection events) and left censored (i.e., protect spells
  38 that were ongoing in 2008-09.  Because protection data was not
  39 recorded reliably over Wikipedia's history, we believe that avoiding
  40 censoring (either left or right) is not technically possible given the
  41 data sources the WMF has published.
  42
  43
  44   |---------|----------------|----->
  45 2004 <x>  2008     <y>     2015  <z>
  46             |--Our Dataset---|
  47
  48 Because we don't have a perfect data for the beginning of any window,
  49 it means that events that happened are invisible to us unless they are
  50 ongoing at the point of data collection.  When our dataset starts in
  51 2008 (as it does in the dataset we produce here where the log format
  52 stabilizes, this means that there are two conditions where data will
  53 be missing (refer to the timeline figure above):
  54
  55 a) If a page is protected in <x> and is deleted in <y> we wll have no
  56    indication that the page was protected at all and no page
  57    protection spells will be included in the database.
  58
  59 b) If a page is protected in <x> and its protection status is changed
  60    at any time <t> during period <y> (i.e., either by adjusting the
  61    rights, updating the expiration time, or unprotecting the page) we
  62    will know that the page was protected during the period from the
  63    beginning of <y> until <t> but will we not know the specific rights
  64    associated with the protection. We will have complete data on the
  65    protection status of that page from <t> onward.
  66
  67
  68 Running the Software
  69 -----------------------------
  70
  71 0. Install dependencies
  72 ===========================
  73
  74 - Python 3
  75 - GNU R
  76 - `data.table` R package available on CRAN
  77
  78 There is also a dependency on a file called `wikipedia_namespaces.csv`
  79 which is included in this repository and which is drawn from data on
  80 this page: https://en.wikipedia.org/wiki/Wikipedia:Namespace
  81
  82 This file is taken from English Wikipedia in 2015. If you are working
  83 with different wikis or with an updated dump, you will likely to need
  84 to update this file.
  85
  86
  87 1. Download Dumps
  88 ==========================
  89
  90 First, download a dump of MediWiki log actions. WMF distributes these
  91 for English Wikipedia at http://dumps.wikimedia.org. For example, the
  92 latest version when we wrote this document was this file:
  93
  94 https://dumps.wikimedia.org/enwiki/20150112/enwiki-20150112-pages-logging.xml.gz
  95
  96 You will also need a page that includes the status of protected pages
  97 at the point that data was created in the database. WMF distributes
  98 these from http://dumps.wikimedia.org as well. For example, the latest
  99 version is here:
 100
 101 https://dumps.wikimedia.org/enwiki/20150112/enwiki-20150112-page_restrictions.sql.gz
 102
 103 Finally, because the table above maps only page ID numbers to
 104 restriction events, we'll need a mapping of page IDs to page titles
 105 which is contained in a table like this one (the latest at the time of
 106 writing):
 107
 108 https://dumps.wikimedia.org/enwiki/20150112/enwiki-20150112-page.sql.gz
 109
 110 2. Parse SQL Tables
 111 ============================
 112
 113 An important first step is parsing the two wQL dumps file into CSV
 114 format. We can use the `mysqldump_to_csv.py` to do this like:
 115
 116 $ zcat enwiki-20150112-page_restrictions.sql.gz | ./02-mysqldump_to_csv.py \
 117   > enwiki-20150112-page_restrictions.csv
 118
 119 $ zcat enwiki-20150112-page.sql.gz | ./02-mysqldump_to_csv.py \
 120   > enwiki-20150112-page.csv
 121
 122 The first file is small and should be very quick. The second file is
 123 larger but still should still take only several minutes.
 124
 125 The number is adapated from: https://github.com/jamesmishra/mysqldump-to-csv
 126
 127 3. Parse Log File
 128 =================================
 129
 130 The log file that contains changes over time is much larger and will
 131 take several hours to parse for English Wikipedia even on a very fast
 132 machine.  The page log can be parsed using the file like:
 133
 134 $ zcat enwiki-20150112-pages-logging.xml.gz | ./03-parse_mw_event_log.py
 135
 136 This will produce several TSV files of the log file in several different
 137 formats:
 138
 139   output-moves.tsv
 140   output-protection.tsv
 141   output-deletions.tsv
 142
 143 If you pass an XML filename to 03-parse_mw_event_log.py, these file
 144 will not be called output but something based on the root of the
 145 filename.
 146
 147 4. Import Data Into R
 148 =============================
 149
 150 Import all of the data that we've created into a series of RData
 151 files. You'll need to first edit the file `04-import_data.R` so that
 152 the input files (all defined at teh very top of the script) match the
 153 files that you've downloaded and created.
 154
 155 Once you've done that, you can run the R script which will load and
 156 process these:
 157
 158 $ R --no-save < 04-import_data.R
 159
 160 This will output an RData file called `processed_log_and_sql_data.RData`.
 161
 162 5. Generate Spells
 163 =============================
 164
 165 Finally, we run the command that reads in all the prepared dataset and
 166 generates the spells dataset:
 167
 168 $ R --no-save < 05-generate_spells.R
 169
 170 This will generate the final page protection dataset: `spells.RData`