X-Git-Url: https://projects.mako.cc/source/wikiq/blobdiff_plain/e861323a25caf19a4dbda73b69f6400b5f183da3..b46991ccfb280f0d64fe3d15fb1862f49f08beaf:/README diff --git a/README b/README index f4bae6a..e5b019c 100644 --- a/README +++ b/README @@ -1,18 +1,18 @@ -wikiq: a WikiMedia XML data dump to .tsv parser - -author: Erik Garrison +wikiq: a simple and fast stream-based MediaWiki XML dump parser +authors: Erik Garrison + Benjamin Mako Hill overview: -wikiq is written in C using expat. It is designed to enable researchers to -rapidly extract revision histories (minus text and comments) from impossibly -large XML datasets. - +wikiq is written in C++ using expat. It is designed to enable +researchers to rapidly extract revision histories (minus text and +comments) from large XML datasets. use: -To use, first make sure you have libexpat and libpcrecpp installed, then: +To use, first make sure you have libexpat and libpcrecpp installed (e.g. +via packages libexpat1 and libpcre3-dev on Ubuntu), then: % make % ./wikiq -h # prints usage @@ -21,26 +21,25 @@ To use, first make sure you have libexpat and libpcrecpp installed, then: features: -In addition to parsing WikiMedia XML data dumps into a tab-separated tabular -format, wikiq extracts article diffs and can execute arbitrary Perl-compatible -regular expressions against the additions and deletions which differentiate any -revision from the previous. Any number of regular expressions may be supplied -on the command line, and may be tagged using the '-n' option. - -MD5 checksums are used at runtime for precise detection of reversions. +In addition to parsing WikiMedia XML data dumps into a tab-separated +tabular format, wikiq can match Perl-compatible regular expressions +against revision content, can extract article diffs, and can match +regexes against the additions and deletions between revisions. Any +number of regular expressions may be supplied on the command line, and +may be tagged using the '-n' and -N options. +MD5 checksums of revisions are used at runtime. output: wikiq generates these fields for each revision: title, articleid, revid, timestamp, anon, editor, editorid, minor, -text_length, text_entropy, text_md5, reversion, additions_size, deletions_size -.... and additional fields for each regex executed against add/delete diffs +text_length, text_md5, reversion, additions_size, deletions_size +.... and additional fields for each regex executed against content or +added/deleted diffs Boolean fields are TRUE/FALSE except in the case of reversion, which is blank unless the article is a revert to a previous revision, in which case, it contains the revision ID of the revision which was reverted to. - -author: Erik Garrison