X-Git-Url: https://projects.mako.cc/source/wikiq/blobdiff_plain/b685a77fb13d526f5b4b216844a216b0dbab1ed3..b709f85fedff91f10850903a0ef609792ca65f51:/README diff --git a/README b/README index ee39e87..e5b019c 100644 --- a/README +++ b/README @@ -1,13 +1,45 @@ -wikiq: a WikiMedia XML data dump to .tsv parser +wikiq: a simple and fast stream-based MediaWiki XML dump parser -author: Erik Garrison +authors: Erik Garrison + Benjamin Mako Hill -wikiq is written in C using expat. It is designed to enable researchers to -rapidly extract revision histories (minus text and comments) from impossibly -large XML datasets. +overview: -To use, first make sure you have libexpat installed, then: +wikiq is written in C++ using expat. It is designed to enable +researchers to rapidly extract revision histories (minus text and +comments) from large XML datasets. + +use: + +To use, first make sure you have libexpat and libpcrecpp installed (e.g. +via packages libexpat1 and libpcre3-dev on Ubuntu), then: % make % ./wikiq -h # prints usage % 7za e -so hugewikidatadump.xml | ./wikiq >hugewikidatadump.tsv + + +features: + +In addition to parsing WikiMedia XML data dumps into a tab-separated +tabular format, wikiq can match Perl-compatible regular expressions +against revision content, can extract article diffs, and can match +regexes against the additions and deletions between revisions. Any +number of regular expressions may be supplied on the command line, and +may be tagged using the '-n' and -N options. + +MD5 checksums of revisions are used at runtime. + +output: + +wikiq generates these fields for each revision: + +title, articleid, revid, timestamp, anon, editor, editorid, minor, +text_length, text_md5, reversion, additions_size, deletions_size +.... and additional fields for each regex executed against content or +added/deleted diffs + +Boolean fields are TRUE/FALSE except in the case of reversion, which is blank +unless the article is a revert to a previous revision, in which case, it +contains the revision ID of the revision which was reverted to. +