X-Git-Url: https://projects.mako.cc/source/wikiq/blobdiff_plain/fd20cb1bc83a8062f951fc43ef7dd16d4da1ab04..32e1edb3eb790387857b5ad6dd2de18187f4367b:/README?ds=inline diff --git a/README b/README index f4bae6a..4478dca 100644 --- a/README +++ b/README @@ -1,14 +1,13 @@ -wikiq: a WikiMedia XML data dump to .tsv parser - -author: Erik Garrison +wikiq: a simple and fast stream-based MediaWiki XML dump parser +authors: Erik Garrison + Benjamin Mako Hill overview: -wikiq is written in C using expat. It is designed to enable researchers to -rapidly extract revision histories (minus text and comments) from impossibly -large XML datasets. - +wikiq is written in C++ using expat. It is designed to enable +researchers to rapidly extract revision histories (minus text and +comments) from large XML datasets. use: @@ -21,26 +20,25 @@ To use, first make sure you have libexpat and libpcrecpp installed, then: features: -In addition to parsing WikiMedia XML data dumps into a tab-separated tabular -format, wikiq extracts article diffs and can execute arbitrary Perl-compatible -regular expressions against the additions and deletions which differentiate any -revision from the previous. Any number of regular expressions may be supplied -on the command line, and may be tagged using the '-n' option. - -MD5 checksums are used at runtime for precise detection of reversions. +In addition to parsing WikiMedia XML data dumps into a tab-separated +tabular format, wikiq can match Perl-compatible regular expressions +against revision content, can extract article diffs, and can match +regexes against the additions and deletions between revisions. Any +number of regular expressions may be supplied on the command line, and +may be tagged using the '-n' and -N options. +MD5 checksums of revisions are used at runtime. output: wikiq generates these fields for each revision: title, articleid, revid, timestamp, anon, editor, editorid, minor, -text_length, text_entropy, text_md5, reversion, additions_size, deletions_size -.... and additional fields for each regex executed against add/delete diffs +text_length, text_md5, reversion, additions_size, deletions_size +.... and additional fields for each regex executed against content or +added/deleted diffs Boolean fields are TRUE/FALSE except in the case of reversion, which is blank unless the article is a revert to a previous revision, in which case, it contains the revision ID of the revision which was reverted to. - -author: Erik Garrison