From: Benjamin Mako Hill Date: Sat, 12 Apr 2014 05:41:44 +0000 (-0700) Subject: updated README file and documentation X-Git-Url: https://projects.mako.cc/source/wikiq/commitdiff_plain/32e1edb3eb790387857b5ad6dd2de18187f4367b updated README file and documentation --- diff --git a/README b/README index f4bae6a..4478dca 100644 --- a/README +++ b/README @@ -1,14 +1,13 @@ -wikiq: a WikiMedia XML data dump to .tsv parser - -author: Erik Garrison +wikiq: a simple and fast stream-based MediaWiki XML dump parser +authors: Erik Garrison + Benjamin Mako Hill overview: -wikiq is written in C using expat. It is designed to enable researchers to -rapidly extract revision histories (minus text and comments) from impossibly -large XML datasets. - +wikiq is written in C++ using expat. It is designed to enable +researchers to rapidly extract revision histories (minus text and +comments) from large XML datasets. use: @@ -21,26 +20,25 @@ To use, first make sure you have libexpat and libpcrecpp installed, then: features: -In addition to parsing WikiMedia XML data dumps into a tab-separated tabular -format, wikiq extracts article diffs and can execute arbitrary Perl-compatible -regular expressions against the additions and deletions which differentiate any -revision from the previous. Any number of regular expressions may be supplied -on the command line, and may be tagged using the '-n' option. - -MD5 checksums are used at runtime for precise detection of reversions. +In addition to parsing WikiMedia XML data dumps into a tab-separated +tabular format, wikiq can match Perl-compatible regular expressions +against revision content, can extract article diffs, and can match +regexes against the additions and deletions between revisions. Any +number of regular expressions may be supplied on the command line, and +may be tagged using the '-n' and -N options. +MD5 checksums of revisions are used at runtime. output: wikiq generates these fields for each revision: title, articleid, revid, timestamp, anon, editor, editorid, minor, -text_length, text_entropy, text_md5, reversion, additions_size, deletions_size -.... and additional fields for each regex executed against add/delete diffs +text_length, text_md5, reversion, additions_size, deletions_size +.... and additional fields for each regex executed against content or +added/deleted diffs Boolean fields are TRUE/FALSE except in the case of reversion, which is blank unless the article is a revert to a previous revision, in which case, it contains the revision ID of the revision which was reverted to. - -author: Erik Garrison diff --git a/wikiq.cpp b/wikiq.cpp index 6a20481..1dce957 100644 --- a/wikiq.cpp +++ b/wikiq.cpp @@ -545,7 +545,8 @@ void print_usage(char* argv[]) { << "unless the article is a revert to a previous revision, in which case, it" << endl << "contains the revision ID of the revision which was reverted to." << endl << endl - << "author: Erik Garrison " << endl; + << "authors: Erik Garrison " << endl; + << " Benjamin Mako Hill " << endl; }