summary |
shortlog | log |
commit |
commitdiff |
tree
first ⋅ prev ⋅ next
Benjamin Mako Hill [Fri, 11 Apr 2014 22:12:43 +0000 (15:12 -0700)]
rename variables so regex/regexes is diff_regex/regex
All the code to check for patterns within diff are named regex,
regexes, etc. This is problematic because (a) we already have one more
place we can check for regex and (b) I'm about to about add code to look
for regex in abitrary revisions (/not/ diffs). This change "makes room"
for the next one.
Benjamin Mako Hill [Fri, 11 Apr 2014 22:05:36 +0000 (15:05 -0700)]
renamed variables for title regex
The variables used to be called wp_namespace_* which was confusing
because the software is not only useful for Wikipedia and is not only
useful for matching namespaces.
Benjamin Mako Hill [Fri, 11 Apr 2014 21:58:40 +0000 (14:58 -0700)]
ignore object files created in compilation
Benjamin Mako Hill [Mon, 2 May 2011 23:55:37 +0000 (19:55 -0400)]
made changes necessary to build a static version
Working with Bernie Innocenti, I completely overhauled the Makefile with
the goal of having it be able to build static versions of the library
which I need to run it on my cluster. In general, this has simplified
the Makefile. I had to make one small change to the disorder file in the
process.
Benjamin Mako Hill [Mon, 2 May 2011 19:56:48 +0000 (15:56 -0400)]
added support for regex title matching
Added support for regex matching of titles to (for example) match only
things from particular namespaces. In the process, I changed the "t"
option which prints out text to "d" for debug and used the t option for
taking title regexes.
Benjamin Mako Hill [Mon, 2 May 2011 18:01:09 +0000 (14:01 -0400)]
lower case N in the usage instructions
Benjamin Mako Hill [Mon, 2 May 2011 17:59:02 +0000 (13:59 -0400)]
fix bug and invoke regex search in first revision
Regex search was only being called in diffs. Of course, if something is
added in the very first revision, it will never show up in a diff. This
was resulting in deletions that don't show up as additions. I've patched
the code so that the first revision is treated as one big addition.
Erik Garrison [Tue, 15 Mar 2011 03:03:04 +0000 (23:03 -0400)]
dynamically resize article text buffer if we would overflow
Erik Garrison [Mon, 14 Mar 2011 00:49:54 +0000 (20:49 -0400)]
updated README
Erik Garrison [Mon, 14 Mar 2011 00:37:24 +0000 (20:37 -0400)]
help text updates
Erik Garrison [Mon, 14 Mar 2011 00:30:43 +0000 (20:30 -0400)]
regex against article diffs and checksum-based revert detection
Erik Garrison [Sun, 13 Mar 2011 18:17:39 +0000 (14:17 -0400)]
added (broken, but running) diff routines for block-level diffs
Erik Garrison [Fri, 28 Jan 2011 12:19:27 +0000 (07:19 -0500)]
add md5sum, entropy
Erik Garrison [Sun, 24 Oct 2010 20:28:15 +0000 (16:28 -0400)]
added shannon_H entropy metric for each revision
Erik Garrison [Sun, 24 Oct 2010 20:10:01 +0000 (16:10 -0400)]
resolved performance bug in cdata handling of revision text
Previously, when appending character data to long text fields, each call
to charhndl required a call to strlen on the text field to which the new
character data was to be appended. This is O(N^2)! The problem was
most severe for vandalized articles with inlined HTML, but it ultimately
affected all data parsing as expat calls the charhndl function every
time it resolves a default XML entity.
By tracking the length of each field in our revisionData structure and
using a custom strncat-type function, it's possible to avoid this
overhead. Now, we are pipe-buffer-bound when processing a 7z-compressed
mediawiki dump. The current simple english wiki dump takes about 3
minutes to process on my 2x2.4ghz laptop, even when handling all text
data. Just decompressing it to /dev/null takes around 1 minute.
Erik Garrison [Sat, 23 Oct 2010 22:57:50 +0000 (18:57 -0400)]
buffer storage of text data, structure cleanup
Using a 2MB fixed-size buffer for the article text data speeds up
processing dramatically when text data is not thrown away.
Erik Garrison [Mon, 11 Oct 2010 17:53:07 +0000 (13:53 -0400)]
properly implemented text and comment omission flag
Erik Garrison [Thu, 7 Oct 2010 12:46:50 +0000 (08:46 -0400)]
moved from gcc to g++
Erik Garrison [Thu, 7 Oct 2010 04:55:57 +0000 (00:55 -0400)]
Revert "changed compilation flags to -O3"
This reverts commit
315071f4419c2e34fd367e09d24818807353c91e.
Erik Garrison [Thu, 7 Oct 2010 04:52:10 +0000 (00:52 -0400)]
changed compilation flags to -O3
Erik Garrison [Thu, 7 Oct 2010 04:50:04 +0000 (00:50 -0400)]
parseData -> revisionData
Erik Garrison [Thu, 7 Oct 2010 04:44:59 +0000 (00:44 -0400)]
first produced in the summer of 2006, finally published!
Benjamin Mako Hill || Want to submit a patch?