11 years agoadded support for regex title matching
Benjamin Mako Hill [Mon, 2 May 2011 19:56:48 +0000 (15:56 -0400)]
added support for regex title matching

Added support for regex matching of titles to (for example) match only
things from particular namespaces. In the process, I changed the "t"
option which prints out text to "d" for debug and used the t option for
taking title regexes.

11 years agolower case N in the usage instructions
Benjamin Mako Hill [Mon, 2 May 2011 18:01:09 +0000 (14:01 -0400)]
lower case N in the usage instructions

11 years agofix bug and invoke regex search in first revision
Benjamin Mako Hill [Mon, 2 May 2011 17:59:02 +0000 (13:59 -0400)]
fix bug and invoke regex search in first revision

Regex search was only being called in diffs. Of course, if something is
added in the very first revision, it will never show up in a diff. This
was resulting in deletions that don't show up as additions. I've patched
the code so that the first revision is treated as one big addition.

11 years agodynamically resize article text buffer if we would overflow
Erik Garrison [Tue, 15 Mar 2011 03:03:04 +0000 (23:03 -0400)]
dynamically resize article text buffer if we would overflow

11 years agoupdated README
Erik Garrison [Mon, 14 Mar 2011 00:49:54 +0000 (20:49 -0400)]
updated README

11 years agohelp text updates
Erik Garrison [Mon, 14 Mar 2011 00:37:24 +0000 (20:37 -0400)]
help text updates

11 years agoregex against article diffs and checksum-based revert detection
Erik Garrison [Mon, 14 Mar 2011 00:30:43 +0000 (20:30 -0400)]
regex against article diffs and checksum-based revert detection

11 years agoadded (broken, but running) diff routines for block-level diffs
Erik Garrison [Sun, 13 Mar 2011 18:17:39 +0000 (14:17 -0400)]
added (broken, but running) diff routines for block-level diffs

11 years agoadd md5sum, entropy
Erik Garrison [Fri, 28 Jan 2011 12:19:27 +0000 (07:19 -0500)]
add md5sum, entropy

11 years agoadded shannon_H entropy metric for each revision
Erik Garrison [Sun, 24 Oct 2010 20:28:15 +0000 (16:28 -0400)]
added shannon_H entropy metric for each revision

11 years agoresolved performance bug in cdata handling of revision text
Erik Garrison [Sun, 24 Oct 2010 20:10:01 +0000 (16:10 -0400)]
resolved performance bug in cdata handling of revision text

Previously, when appending character data to long text fields, each call
to charhndl required a call to strlen on the text field to which the new
character data was to be appended.  This is O(N^2)!  The problem was
most severe for vandalized articles with inlined HTML, but it ultimately
affected all data parsing as expat calls the charhndl function every
time it resolves a default XML entity.

By tracking the length of each field in our revisionData structure and
using a custom strncat-type function, it's possible to avoid this
overhead.  Now, we are pipe-buffer-bound when processing a 7z-compressed
mediawiki dump.  The current simple english wiki dump takes about 3
minutes to process on my 2x2.4ghz laptop, even when handling all text
data.  Just decompressing it to /dev/null takes around 1 minute.

11 years agobuffer storage of text data, structure cleanup
Erik Garrison [Sat, 23 Oct 2010 22:57:50 +0000 (18:57 -0400)]
buffer storage of text data, structure cleanup

Using a 2MB fixed-size buffer for the article text data speeds up
processing dramatically when text data is not thrown away.

11 years agoproperly implemented text and comment omission flag
Erik Garrison [Mon, 11 Oct 2010 17:53:07 +0000 (13:53 -0400)]
properly implemented text and comment omission flag

11 years agomoved from gcc to g++
Erik Garrison [Thu, 7 Oct 2010 12:46:50 +0000 (08:46 -0400)]
moved from gcc to g++

11 years agoRevert "changed compilation flags to -O3"
Erik Garrison [Thu, 7 Oct 2010 04:55:57 +0000 (00:55 -0400)]
Revert "changed compilation flags to -O3"

This reverts commit 315071f4419c2e34fd367e09d24818807353c91e.

11 years agochanged compilation flags to -O3
Erik Garrison [Thu, 7 Oct 2010 04:52:10 +0000 (00:52 -0400)]
changed compilation flags to -O3

11 years agoparseData -> revisionData
Erik Garrison [Thu, 7 Oct 2010 04:50:04 +0000 (00:50 -0400)]
parseData -> revisionData

11 years agofirst produced in the summer of 2006, finally published!
Erik Garrison [Thu, 7 Oct 2010 04:44:59 +0000 (00:44 -0400)]
first produced in the summer of 2006, finally published!

Benjamin Mako Hill || Want to submit a patch?