Added support for regex matching of titles to (for example) match only
things from particular namespaces. In the process, I changed the "t"
option which prints out text to "d" for debug and used the t option for
taking title regexes.
Regex search was only being called in diffs. Of course, if something is
added in the very first revision, it will never show up in a diff. This
was resulting in deletions that don't show up as additions. I've patched
the code so that the first revision is treated as one big addition.
Erik Garrison [Sun, 24 Oct 2010 20:10:01 +0000 (16:10 -0400)]
resolved performance bug in cdata handling of revision text
Previously, when appending character data to long text fields, each call
to charhndl required a call to strlen on the text field to which the new
character data was to be appended. This is O(N^2)! The problem was
most severe for vandalized articles with inlined HTML, but it ultimately
affected all data parsing as expat calls the charhndl function every
time it resolves a default XML entity.
By tracking the length of each field in our revisionData structure and
using a custom strncat-type function, it's possible to avoid this
overhead. Now, we are pipe-buffer-bound when processing a 7z-compressed
mediawiki dump. The current simple english wiki dump takes about 3
minutes to process on my 2x2.4ghz laptop, even when handling all text
data. Just decompressing it to /dev/null takes around 1 minute.