projects.mako.cc - redirect-tools/blob - README.rst

   1 MediaWiki Redirect Tools
   2 =======================================================================
   3
   4   | **Author:** `Benjamin Mako Hill`__ <mako@atdot.cc> and `Aaron Shaw`__ <aaron.d.shaw@gmail.com>
   5   | **Homepage:** http://communitydata.cc/wiki-redirects/
   6   | **Code:** http://projects.mako.cc/source/?p=redirect-tools
   7   | **License:** `GNU GPLv3 or any later version`__
   8   | **Description:** Tools to to generate a redirect spells dataset from "raw" MediaWiki XML dumps like those published by the Wikimedia foundation.
   9   | **Archival copy:** http://dx.doi.org/10.7910/DVN/NQSHQD
  10
  11 __ http://mako.cc/
  12 __ http://aaronshaw.org/
  13 __ http://www.gnu.org/copyleft/gpl.html
  14
  15 If you use this software for research, please cite the following
  16 paper:
  17
  18   Hill, Benjamin Mako & Shaw, Aaron. (2014) "Consider the Redirect: A
  19   Missing Dimension of Wikipedia Research." In *Proceedings of the 10th
  20   International Symposium on Open Collaboration (OpenSym 2014)*. ACM
  21   Press. `doi: 10.1145/2641580.2641616`__
  22
  23 __ https://doi.org/10.1145/2641580.2641616
  24
  25 **Overview:**
  26
  27 To use these tools, you will need need to start with a MediaWiki dump
  28 file. For Wikimedia Foundation projects, you can download them all from:
  29 http://dumps.wikimedia.org/
  30
  31 Wikis from Wikia.com and other Wikimedia projects all use the same XML format
  32 for their projects.
  33
  34 In the examples in this README, I will use a dump of `Simple English
  35 Wikipedia`__ that I downloaded with the following command::
  36
  37   wget http://dumps.wikimedia.org/simplewiki/20140410/simplewiki-20140410-pages-meta-history.xml.7z
  38
  39 __ https://simple.wikipedia.org/
  40
  41 Before you start, you may also want to change the default directories for writing intermediate output files.  The default directories for writing and reading files are at the top of the file `redirect_tools.R` and can be changed by editing that file. By default, all files will be written to the subdirectory "./output" in the local directory. If you want to use the default directories, you will still need to create them with a command like this::
  42
  43   mkdir output output/redir output/spells
  44
  45 Step 1: Find Redirects in Revisions
  46 -----------------------------------------
  47
  48 Dependencies:
  49
  50 - Python 2.7
  51 - Wikimedia Utilities (https://bitbucket.org/halfak/wikimedia-utilities)
  52
  53 Input:
  54
  55 - Wikimedia XML Dump files (compressed in some form)
  56
  57 Output:
  58
  59 - bzip2 compressed TSV files (one line per revision)
  60
  61 You will run the `01-extract_redirects.py` script to build a dataset of revisions or edits that marks every revisions as either containing a redirect, or not. `01-extract_redirects.py` takes a MediaWiki dump file on STDIN and output a TSV file on STDOUT of the following form:
  62
  63 +---------+-------------+--------------------------------+------------+---------+----------+--------------------+
  64 | page.id | revision.id | page.title                     | timestamp  | deleted | redirect | target             |
  65 +=========+=============+================================+============+=========+==========+====================+
  66 | 1935456 | 17563584    | Mikhail Alekseevich Lavrentiev | 1116962833 | FALSE   | FALSE    | NA                 |
  67 | 1935456 | 22034930    | Mikhail Alekseevich Lavrentiev | 1125245577 | FALSE   | TRUE     | Mikhail Lavrentyev |
  68 +---------+-------------+--------------------------------+------------+---------+----------+--------------------+
  69
  70
  71 In this (example) case, the first revision of the article "Mikhail Alekseevich Lavrentiev" was not a redirect but the second is a redirect to "Mikhail Lavrentyev."
  72
  73 If you are using the Simple English dump (which is a single file) you would run the following command to uncompress the dump, parse it using our script, compress the output again, and save the output to the default destination::
  74
  75   7za x -so simplewiki-20140410-pages-meta-history.xml.7z |
  76   python2.7 01-extract_redirects.py | bzip2 -c - > output/redir/simple_redirs.tsz.bz2
  77
  78 Because our dumpfile is 7z compressed, I used 7za to uncompress it. If I had used a gzip or bzip compressed file, I would use `zcat` or `bzcat` instead. I'm also catting the output to `bzip2 -c` which will bzip the TSV output to conserve space. The next step assumes a bzip2 compressed file. If you don't want to use bzip2 to compress, you'll need to modify the code.
  79
  80
  81 Step 2: Generate spells
  82 -----------------------------------------
  83
  84 Dependencies:
  85
  86 - GNU R
  87 - data.table (http://cran.r-project.org/web/packages/data.table/)
  88 - foreign (http://cran.r-project.org/web/packages/foreign/)
  89
  90 Input:
  91
  92 - bzip compressed TSV files
  93
  94 Output:
  95
  96 - RData files containing a data.frame of redirect spells named `redirect.spell`
  97   (one file per input file)
  98 - Stata DTA file (same data)
  99 - TSV file (same data)
 100
 101 The file `redirect_tools.R` contains an R function `generate.spells()` that
 102 takes a data frame of edit data as created in step 1 and a list of page titles
 103 in order to create a list of redirect spells for those pages. It also
 104 contains a function `filename.to.spells()` which takes the filename of a bzip
 105 compressed file of the form created in step 1 and outputs a full list of
 106 redirect spells.
 107
 108 You can run the command with::
 109
 110   R --no-save < 02-generate_spells.R
 111
 112 By default, output will be saved into `output/spells`. The script will
 113 save three versions of the output:
 114
 115 1. `redirect_spells.RData` — An RData file suitable for use in GNU R
 116 2. `redirect_spells.tsv` — A tab seperated values file suitable for use in a variety of different programs.
 117 3. `redirect_spells.dta` — A DTA file suitable for use in Stata (many versions will crop very long artiicle titles due to limitations in the DTA format).
 118
 119
 120 Running Code in Parallel
 121 -----------------------------------------
 122
 123 Because the full history dumps from the WMF foundation are split into
 124 many files, it is usually appropriate to parse these dumps in
 125 parallel. Although the specific ways you choose to do this will vary
 126 by the queuing or scheduling system you use, we've included examples
 127 of the scripts we used with Condor on the Harvard/MIT Data Center
 128 (HMDC) in the `examples/` directory of the source code. They will not
 129 work without modification for your computing environment because they
 130 have configuration options and paths for our environment
 131 hardcoded. That said, they may give you an idea of where you might
 132 want to start.
 133
 134 In this parallel code there is a third file
 135 `03-assemble_redirect_spells.R` that contains R code that will read in
 136 all of the separate RData files created in paralell processing,
 137 assemble the many smaller dataframes into a single dataframe, and then
 138 saves that unified data.frame into a single RData file.
 139