From e861323a25caf19a4dbda73b69f6400b5f183da3 Mon Sep 17 00:00:00 2001 From: Erik Garrison Date: Sun, 13 Mar 2011 20:49:54 -0400 Subject: [PATCH] updated README --- README | 35 ++++++++++++++++++++++++++++++++++- wikiq.cpp | 1 - 2 files changed, 34 insertions(+), 2 deletions(-) diff --git a/README b/README index ee39e87..f4bae6a 100644 --- a/README +++ b/README @@ -2,12 +2,45 @@ wikiq: a WikiMedia XML data dump to .tsv parser author: Erik Garrison + +overview: + wikiq is written in C using expat. It is designed to enable researchers to rapidly extract revision histories (minus text and comments) from impossibly large XML datasets. -To use, first make sure you have libexpat installed, then: + +use: + +To use, first make sure you have libexpat and libpcrecpp installed, then: % make % ./wikiq -h # prints usage % 7za e -so hugewikidatadump.xml | ./wikiq >hugewikidatadump.tsv + + +features: + +In addition to parsing WikiMedia XML data dumps into a tab-separated tabular +format, wikiq extracts article diffs and can execute arbitrary Perl-compatible +regular expressions against the additions and deletions which differentiate any +revision from the previous. Any number of regular expressions may be supplied +on the command line, and may be tagged using the '-n' option. + +MD5 checksums are used at runtime for precise detection of reversions. + + +output: + +wikiq generates these fields for each revision: + +title, articleid, revid, timestamp, anon, editor, editorid, minor, +text_length, text_entropy, text_md5, reversion, additions_size, deletions_size +.... and additional fields for each regex executed against add/delete diffs + +Boolean fields are TRUE/FALSE except in the case of reversion, which is blank +unless the article is a revert to a previous revision, in which case, it +contains the revision ID of the revision which was reverted to. + + +author: Erik Garrison diff --git a/wikiq.cpp b/wikiq.cpp index 6635dc0..bd895d4 100644 --- a/wikiq.cpp +++ b/wikiq.cpp @@ -498,7 +498,6 @@ void print_usage(char* argv[]) { << "text_length, text_entropy, text_md5, reversion, additions_size, deletions_size" << endl << ".... and additional fields for each regex executed against add/delete diffs" << endl << endl - << "Notes:" << endl << "Boolean fields are TRUE/FALSE except in the case of reversion, which is blank" << endl << "unless the article is a revert to a previous revision, in which case, it" << endl << "contains the revision ID of the revision which was reverted to." << endl -- 2.30.2