1 Step 1: Flag Redirects in Revisions
2 ====================================
7 - Wikimedia Utilities (https://bitbucket.org/halfak/wikimedia-utilities)
9 Input: 7z compressed Wikimedia XML Dump files
10 Output: bzip compressed TSV files (one file per input file; one line per revision)
12 Run the file `01-extract_redirects.py` to build a dataset of revisions or edits
13 that marks every revisions as either containinig a revision, or not.
15 The script `01-extract_redirects.py` takes a MediaWiki dump file on STDIN and
16 outputs a TSV file on STDOUT of the following form:
18 > page.id revision.id page.title timestamp deleted redirect target
19 > 1935456 17563584 Mikhail Alekseevich Lavrentiev 1116962833 FALSE FALSE NA
20 > 1935456 22034930 Mikhail Alekseevich Lavrentiev 1125245577 FALSE TRUE Mikhail Lavrentyev
22 In this case, the first revision of the article "Mikhail Alekseevich
23 Lavrentiev" was not a redirect but the second is a redirect to "Mikhail
26 Because the full history dumps from the WMF foundation are split into many
27 files, it is can be appropriate to parse these dumps in parallel. Although the
28 specific ways you choose to do this will vary by system, we've included
29 examples of the scripts we used with Condor on the Harvard/MIT Data Center
30 (HMDC) in the "examples/" directory. They will not work without modification
31 for your computing environment but they will give you an idea of where you
34 Step 2: Generate spells
35 ====================================
40 - data.table (http://cran.r-project.org/web/packages/data.table/)
42 Input: bzip compressed TSV files
43 Output: RData files containing data.frame of redirect spells named
44 `redirect.spell` (one file per input file)
46 The file `02-generate_spells.R` contains an R function `generate.spells()` that
47 takes a data frame of edit data as created in step 1 and a list of page title
48 and which will create a list of redirect spells for those pages.
50 It also contains a function `filename.to.spells()` which takes the filename of
51 a bzip compressed file of the form created in step 1 and outputs a full list of
54 In its current form, the R code is designed to be run on the HMDC cluster using
55 Condor using the scripts prefixed with "02" in the examples directory. These
56 scripts can be modified to work in different configurations.
58 Step 3: Assemble Redirects Data
59 ====================================
65 Output: RData files containing data.frame of redirect spells named
67 Output: A combined RData file that contains all redirect spells
69 The file `03-assemble_redirect_spells.R` contains R code that will read in all
70 of the separate RData files, assmebles the many smaller dataframes into a
71 single data.frame, and then saves that unified data.frame into a single RData