projects.mako.cc - wikiq/log

rename variables so regex/regexes is diff_regex/regex

All the code to check for patterns within diff are named regex,
regexes, etc. This is problematic because (a) we already have one more
place we can check for regex and (b) I'm about to about add code to look
for regex in abitrary revisions (/not/ diffs). This change "makes room"
for the next one.

commit | commitdiff | tree

Benjamin Mako Hill [Fri, 11 Apr 2014 22:05:36 +0000 (15:05 -0700)]

renamed variables for title regex

The variables used to be called wp_namespace_* which was confusing
because the software is not only useful for Wikipedia and is not only
useful for matching namespaces.

commit | commitdiff | tree

Benjamin Mako Hill [Fri, 11 Apr 2014 21:58:40 +0000 (14:58 -0700)]

ignore object files created in compilation

commit | commitdiff | tree

Erik Garrison [Mon, 6 Jun 2011 23:15:49 +0000 (16:15 -0700)]

Merge pull request #2 from makoshark/master

regex titles and static linking

commit | commitdiff | tree

Benjamin Mako Hill [Mon, 2 May 2011 23:55:37 +0000 (19:55 -0400)]

made changes necessary to build a static version

Working with Bernie Innocenti, I completely overhauled the Makefile with
the goal of having it be able to build static versions of the library
which I need to run it on my cluster. In general, this has simplified
the Makefile. I had to make one small change to the disorder file in the
process.

commit | commitdiff | tree

Benjamin Mako Hill [Mon, 2 May 2011 19:56:48 +0000 (15:56 -0400)]

added support for regex title matching

Added support for regex matching of titles to (for example) match only
things from particular namespaces. In the process, I changed the "t"
option which prints out text to "d" for debug and used the t option for
taking title regexes.

commit | commitdiff | tree

Erik Garrison [Mon, 2 May 2011 18:06:05 +0000 (11:06 -0700)]

Merge pull request #1 from makoshark/master.

Two Quick Fixes from Mako

commit | commitdiff | tree

Benjamin Mako Hill [Mon, 2 May 2011 18:01:09 +0000 (14:01 -0400)]

lower case N in the usage instructions

commit | commitdiff | tree

Benjamin Mako Hill [Mon, 2 May 2011 17:59:02 +0000 (13:59 -0400)]

fix bug and invoke regex search in first revision

Regex search was only being called in diffs. Of course, if something is
added in the very first revision, it will never show up in a diff. This
was resulting in deletions that don't show up as additions. I've patched
the code so that the first revision is treated as one big addition.

commit | commitdiff | tree

Erik Garrison [Tue, 15 Mar 2011 03:03:04 +0000 (23:03 -0400)]

dynamically resize article text buffer if we would overflow

commit | commitdiff | tree

Erik Garrison [Mon, 14 Mar 2011 00:49:54 +0000 (20:49 -0400)]

updated README

commit | commitdiff | tree

Erik Garrison [Mon, 14 Mar 2011 00:37:24 +0000 (20:37 -0400)]

help text updates

commit | commitdiff | tree

Erik Garrison [Mon, 14 Mar 2011 00:30:43 +0000 (20:30 -0400)]

regex against article diffs and checksum-based revert detection

commit | commitdiff | tree

Erik Garrison [Sun, 13 Mar 2011 18:17:39 +0000 (14:17 -0400)]

added (broken, but running) diff routines for block-level diffs

commit | commitdiff | tree

Erik Garrison [Fri, 28 Jan 2011 12:19:27 +0000 (07:19 -0500)]

add md5sum, entropy

commit | commitdiff | tree

Erik Garrison [Sun, 24 Oct 2010 20:28:15 +0000 (16:28 -0400)]

added shannon_H entropy metric for each revision

commit | commitdiff | tree

Erik Garrison [Sun, 24 Oct 2010 20:10:01 +0000 (16:10 -0400)]

resolved performance bug in cdata handling of revision text

Previously, when appending character data to long text fields, each call
to charhndl required a call to strlen on the text field to which the new
character data was to be appended.  This is O(N^2)!  The problem was
most severe for vandalized articles with inlined HTML, but it ultimately
affected all data parsing as expat calls the charhndl function every
time it resolves a default XML entity.

By tracking the length of each field in our revisionData structure and
using a custom strncat-type function, it's possible to avoid this
overhead.  Now, we are pipe-buffer-bound when processing a 7z-compressed
mediawiki dump.  The current simple english wiki dump takes about 3
minutes to process on my 2x2.4ghz laptop, even when handling all text
data.  Just decompressing it to /dev/null takes around 1 minute.

commit | commitdiff | tree

Erik Garrison [Sat, 23 Oct 2010 22:57:50 +0000 (18:57 -0400)]

buffer storage of text data, structure cleanup

Using a 2MB fixed-size buffer for the article text data speeds up
processing dramatically when text data is not thrown away.

commit | commitdiff | tree

Erik Garrison [Mon, 11 Oct 2010 17:53:07 +0000 (13:53 -0400)]

properly implemented text and comment omission flag

commit | commitdiff | tree

Erik Garrison [Thu, 7 Oct 2010 12:46:50 +0000 (08:46 -0400)]

moved from gcc to g++

commit | commitdiff | tree

Erik Garrison [Thu, 7 Oct 2010 04:55:57 +0000 (00:55 -0400)]

Revert "changed compilation flags to -O3"

This reverts commit 315071f4419c2e34fd367e09d24818807353c91e.

commit | commitdiff | tree

Erik Garrison [Thu, 7 Oct 2010 04:52:10 +0000 (00:52 -0400)]

changed compilation flags to -O3