X-Git-Url: https://projects.mako.cc/source/wikiq/blobdiff_plain/95cc4c318cf4e6e46b8ae9e947799eee9d30a8c9..e861323a25caf19a4dbda73b69f6400b5f183da3:/README diff --git a/README b/README index ee39e87..f4bae6a 100644 --- a/README +++ b/README @@ -2,12 +2,45 @@ wikiq: a WikiMedia XML data dump to .tsv parser author: Erik Garrison + +overview: + wikiq is written in C using expat. It is designed to enable researchers to rapidly extract revision histories (minus text and comments) from impossibly large XML datasets. -To use, first make sure you have libexpat installed, then: + +use: + +To use, first make sure you have libexpat and libpcrecpp installed, then: % make % ./wikiq -h # prints usage % 7za e -so hugewikidatadump.xml | ./wikiq >hugewikidatadump.tsv + + +features: + +In addition to parsing WikiMedia XML data dumps into a tab-separated tabular +format, wikiq extracts article diffs and can execute arbitrary Perl-compatible +regular expressions against the additions and deletions which differentiate any +revision from the previous. Any number of regular expressions may be supplied +on the command line, and may be tagged using the '-n' option. + +MD5 checksums are used at runtime for precise detection of reversions. + + +output: + +wikiq generates these fields for each revision: + +title, articleid, revid, timestamp, anon, editor, editorid, minor, +text_length, text_entropy, text_md5, reversion, additions_size, deletions_size +.... and additional fields for each regex executed against add/delete diffs + +Boolean fields are TRUE/FALSE except in the case of reversion, which is blank +unless the article is a revert to a previous revision, in which case, it +contains the revision ID of the revision which was reverted to. + + +author: Erik Garrison