projects.mako.cc - redirect-tools/blob - README.rst

   1 MediaWiki Redirect Tools
   2 =======================================================================
   3
   4   | **Author:** `Benjamin Mako Hill`__ <mako@atdot.cc>
   5   | **Homepage:** http://networkcollectiv.es/wiki-redirects/
   6   | **License:** `GNU GPLv3 or any later version`__ (see COPYING)
   7   | **Description:** Tools to to generate a redirect spells dataset from "raw" MediaWiki XML dumps like those published by the Wikimedia foundation.
   8
   9 __ http://mako.cc/
  10 __ http://www.gnu.org/copyleft/gpl.html
  11
  12 If you use this software for research, please **cite the following
  13 paper** in any resulting publication:
  14
  15   *Hill, Benjamin Mako and Aaron Shaw. "Consider the Redirect:  A Missing
  16   Dimension of Wikipedia Research." In Proceedings of the 10th International
  17   Symposium on Open Collaboration (OpenSym 2014). ACM Press, 2014.*
  18
  19 **Overview:**
  20
  21 To use these tools, you will need need to start with a MediaWiki dump
  22 file. For Wikimedia Foundation projects, you can download them all from:
  23 http://dumps.wikimedia.org/
  24
  25 Wikis from Wikia.com and other Wikimedia projects all use the same XML format
  26 for their projects.
  27
  28 In the examples in this README, I will use a dump of `Simple English
  29 Wikipedia`__ that I downloaded with the following command::
  30
  31   wget http://dumps.wikimedia.org/simplewiki/20140410/simplewiki-20140410-pages-meta-history.xml.7z
  32
  33 __ https://simple.wikipedia.org/
  34
  35 Before you start, you may also want to change the default directories
  36 for writing intermediate output files.  The default directories for
  37 writing and reading files are at the top of the file `redirect_tools.R`
  38 and can be changed by editing that file. By default, all files will be
  39 written to the subdirectory "./output" in the local directory. If you
  40 want to use the default directories, you will still need to create them
  41 with a command like this::
  42
  43   mkdir output/redir output/spells
  44
  45 Step 2: Find Redirects in Revisions
  46 =======================================================================
  47
  48 Dependencies:
  49
  50 - Python 2.7
  51 - Wikimedia Utilities (https://bitbucket.org/halfak/wikimedia-utilities)
  52
  53 Input:
  54
  55 - Wikimedia XML Dump files (compressed in some form)
  56
  57 Output:
  58
  59 - bzip2 compressed TSV files (one line per revision)
  60
  61 Run the file `01-extract_redirects.py` to build a dataset of revisions or edits
  62 that marks every revisions as either containinig a revision, or not.
  63
  64 The script `01-extract_redirects.py` takes a MediaWiki dump file on STDIN and
  65 outputs a TSV file on STDOUT of the following form.
  66
  67 +---------+-------------+--------------------------------+------------+---------+----------+--------------------+
  68 | page.id | revision.id | page.title                     | timestamp  | deleted | redirect | target             |
  69 +=========+=============+================================+============+=========+==========+====================+
  70 | 1935456 | 17563584    | Mikhail Alekseevich Lavrentiev | 1116962833 | FALSE   | FALSE    | NA                 |
  71 | 1935456 | 22034930    | Mikhail Alekseevich Lavrentiev | 1125245577 | FALSE   | TRUE     | Mikhail Lavrentyev |
  72 +---------+-------------+--------------------------------+------------+---------+----------+--------------------+
  73
  74
  75 In this case, the first revision of the article "Mikhail Alekseevich
  76 Lavrentiev" was not a redirect but the second is a redirect to "Mikhail
  77 Lavrentyev".
  78
  79 If you were using the Simple English dump (which is a single file) I would
  80 run the following command to send the output to the default ::
  81
  82   7za x -so simplewiki-20140410-pages-meta-history.xml.7z |
  83   python2.7 01-extract_redirects.py | bzip2 -c - > output/redir/simple_redirs.tsz.bz2
  84
  85 Because our dumpfile is 7z compressed, I used 7za to uncompress it. If I had
  86 used a gzip or bzip compressed file, I would use `zcat` or `bzcat` instead. I'm
  87 also catting the output to `bzip2 -c` which will bzip the TSV output to
  88 conserve space. The next step assumes a bzip2 compressed file. If you don't
  89 want to bzip2 compress, you'll need to modify the code.
  90
  91
  92 Step 2: Generate spells
  93 =======================================================================
  94
  95 Dependencies:
  96
  97 - GNU R
  98 - data.table (http://cran.r-project.org/web/packages/data.table/)
  99 - foriegn (http://cran.r-project.org/web/packages/foreign/)
 100
 101 Input:
 102
 103 - bzip compressed TSV files
 104
 105 Outp1ut:
 106
 107 - RData files containing data.frame of redirect spells named `redirect.spell`
 108   (one file per input file)
 109 - Stata DTA file (same data)
 110 - TSV file (same data)
 111
 112 The file `redirect_tools.R` contains an R function `generate.spells()` that
 113 takes a data frame of edit data as created in step 1 and a list of page title
 114 and which will create a list of redirect spells for those pages.  It also
 115 contains a function `filename.to.spells()` which takes the filename of a bzip
 116 compressed file of the form created in step 1 and outputs a full list of
 117 redirect spells.
 118
 119 You can run the command with::
 120
 121   R --no-save < 02-generate_spells.R
 122
 123 By default, output will be saved into `output/spells`.
 124
 125 The
 126
 127 Running Code in Parallel
 128 =======================================================================
 129
 130 Because the full history dumps from the WMF foundation are split into many
 131 files, it is can be appropriate to parse these dumps in parallel. Although the
 132 specific ways you choose to do this will vary by the queuing system you use,
 133 we've included examples of the scripts we used with Condor on the Harvard/MIT
 134 Data Center (HMDC) in the "examples/" directory. They will not work without
 135 modification for your computing environment because they have our environment
 136 hardcoded in but they will give you an idea of where you might want to start.
 137
 138 Additionally, there is a third step `03-assemble_redirect_spells.R` that
 139 contains R code that will read in all of the separate RData files, assmebles
 140 the many smaller dataframes into a single data.frame, and then saves that
 141 unified data.frame into a single RData file.
 142