Step 1: Flag Redirects in Revisions ==================================== Dependencies: - Python 2.7 - Wikimedia Utilities (https://bitbucket.org/halfak/wikimedia-utilities) Input: 7z compressed Wikimedia XML Dump files Output: bzip compressed TSV files (one file per input file; one line per revision) Run the file `01-extract_redirects.py` to build a dataset of revisions or edits that marks every revisions as either containinig a revision, or not. The script `01-extract_redirects.py` takes a MediaWiki dump file on STDIN and outputs a TSV file on STDOUT of the following form: > page.id revision.id page.title timestamp deleted redirect target > 1935456 17563584 Mikhail Alekseevich Lavrentiev 1116962833 FALSE FALSE NA > 1935456 22034930 Mikhail Alekseevich Lavrentiev 1125245577 FALSE TRUE Mikhail Lavrentyev In this case, the first revision of the article "Mikhail Alekseevich Lavrentiev" was not a redirect but the second is a redirect to "Mikhail Lavrentyev". Because the full history dumps from the WMF foundation are split into many files, it is can be appropriate to parse these dumps in parallel. Although the specific ways you choose to do this will vary by system, we've included examples of the scripts we used with Condor on the Harvard/MIT Data Center (HMDC) in the "examples/" directory. They will not work without modification for your computing environment but they will give you an idea of where you might want to start.. Step 2: Generate spells ==================================== Dependencies: - GNU R - data.table (http://cran.r-project.org/web/packages/data.table/) Input: bzip compressed TSV files Output: RData files containing data.frame of redirect spells named `redirect.spell` (one file per input file) The file `02-generate_spells.R` contains an R function `generate.spells()` that takes a data frame of edit data as created in step 1 and a list of page title and which will create a list of redirect spells for those pages. It also contains a function `filename.to.spells()` which takes the filename of a bzip compressed file of the form created in step 1 and outputs a full list of redirect spells. In its current form, the R code is designed to be run on the HMDC cluster using Condor using the scripts prefixed with "02" in the examples directory. These scripts can be modified to work in different configurations. Step 3: Assemble Redirects Data ==================================== Dependencies: - GNU R Output: RData files containing data.frame of redirect spells named `redirect.spell` Output: A combined RData file that contains all redirect spells The file `03-assemble_redirect_spells.R` contains R code that will read in all of the separate RData files, assmebles the many smaller dataframes into a single data.frame, and then saves that unified data.frame into a single RData file.