rename variables so regex/regexes is diff_regex/regex
All the code to check for patterns within diff are named regex,
regexes, etc. This is problematic because (a) we already have one more
place we can check for regex and (b) I'm about to about add code to look
for regex in abitrary revisions (/not/ diffs). This change "makes room"
for the next one.
The variables used to be called wp_namespace_* which was confusing
because the software is not only useful for Wikipedia and is not only
useful for matching namespaces.
Working with Bernie Innocenti, I completely overhauled the Makefile with
the goal of having it be able to build static versions of the library
which I need to run it on my cluster. In general, this has simplified
the Makefile. I had to make one small change to the disorder file in the
process.
Added support for regex matching of titles to (for example) match only
things from particular namespaces. In the process, I changed the "t"
option which prints out text to "d" for debug and used the t option for
taking title regexes.
Regex search was only being called in diffs. Of course, if something is
added in the very first revision, it will never show up in a diff. This
was resulting in deletions that don't show up as additions. I've patched
the code so that the first revision is treated as one big addition.
Erik Garrison [Sun, 24 Oct 2010 20:10:01 +0000 (16:10 -0400)]
resolved performance bug in cdata handling of revision text
Previously, when appending character data to long text fields, each call
to charhndl required a call to strlen on the text field to which the new
character data was to be appended. This is O(N^2)! The problem was
most severe for vandalized articles with inlined HTML, but it ultimately
affected all data parsing as expat calls the charhndl function every
time it resolves a default XML entity.
By tracking the length of each field in our revisionData structure and
using a custom strncat-type function, it's possible to avoid this
overhead. Now, we are pipe-buffer-bound when processing a 7z-compressed
mediawiki dump. The current simple english wiki dump takes about 3
minutes to process on my 2x2.4ghz laptop, even when handling all text
data. Just decompressing it to /dev/null takes around 1 minute.