8 years agoMerge branch 'patch-3' of master
Benjamin Mako Hill [Fri, 18 Jul 2014 21:40:28 +0000 (14:40 -0700)]
Merge branch 'patch-3' of

8 years agoMerge branch 'patch-2' of
Benjamin Mako Hill [Fri, 18 Jul 2014 21:39:06 +0000 (14:39 -0700)]
Merge branch 'patch-2' of

8 years agoupdated patch that updates README from nemobis
Benjamin Mako Hill [Fri, 18 Jul 2014 21:35:35 +0000 (14:35 -0700)]
updated patch that updates README from nemobis

- specified that the Ubuntu packages are also Debian packages

8 years agoMerge remote-tracking branch 'gh-nemobis/patch-1'
Benjamin Mako Hill [Fri, 18 Jul 2014 21:34:27 +0000 (14:34 -0700)]
Merge remote-tracking branch 'gh-nemobis/patch-1'

8 years agoMerge branch 'master' of
Benjamin Mako Hill [Fri, 18 Jul 2014 21:31:11 +0000 (14:31 -0700)]
Merge branch 'master' of

8 years agoTypofix
nemobis [Fri, 11 Jul 2014 22:36:17 +0000 (00:36 +0200)]

8 years agoRemove excess semicolon
nemobis [Fri, 11 Jul 2014 22:17:50 +0000 (00:17 +0200)]
Remove excess semicolon

$ make
g++ -O3    -c -o wikiq.o wikiq.cpp
wikiq.cpp: In function ‘void print_usage(char**)’:
wikiq.cpp:549:10: error: expected primary-expression before ‘<<’ token
make: *** [wikiq.o] Error 1

8 years agoPackages suggestion for the lazy deb users
nemobis [Fri, 11 Jul 2014 22:14:03 +0000 (00:14 +0200)]
Packages suggestion for the lazy deb users

8 years agoupdated README file and documentation
Benjamin Mako Hill [Sat, 12 Apr 2014 05:41:44 +0000 (22:41 -0700)]
updated README file and documentation

8 years agoremove shannon entropy as something computed
Benjamin Mako Hill [Sat, 12 Apr 2014 02:56:45 +0000 (19:56 -0700)]
remove shannon entropy as something computed

It's really just nothing something most people will use it and it's
somewhat expensive to compute.

8 years agocollapse datetime into a single header
Benjamin Mako Hill [Sat, 12 Apr 2014 02:53:18 +0000 (19:53 -0700)]
collapse datetime into a single header

It's already a single field.

8 years agoadded code and docs to search for regex within the content of revisions
Benjamin Mako Hill [Fri, 11 Apr 2014 22:55:11 +0000 (15:55 -0700)]
added code and docs to search for regex within the content of revisions

8 years agochange diff regex options from -n and -r to -N and -R
Benjamin Mako Hill [Fri, 11 Apr 2014 22:29:40 +0000 (15:29 -0700)]
change diff regex options from -n and -r to -N and -R

8 years agorename variables so regex/regexes is diff_regex/regex
Benjamin Mako Hill [Fri, 11 Apr 2014 22:12:43 +0000 (15:12 -0700)]
rename variables so regex/regexes is diff_regex/regex

All the code to check for patterns within diff are named regex,
regexes, etc. This is problematic because (a) we already have one more
place we can check for regex and (b) I'm about to about add code to look
for regex in abitrary revisions (/not/ diffs). This change "makes room"
for the next one.

8 years agorenamed variables for title regex
Benjamin Mako Hill [Fri, 11 Apr 2014 22:05:36 +0000 (15:05 -0700)]
renamed variables for title regex

The variables used to be called wp_namespace_* which was confusing
because the software is not only useful for Wikipedia and is not only
useful for matching namespaces.

8 years agoignore object files created in compilation
Benjamin Mako Hill [Fri, 11 Apr 2014 21:58:40 +0000 (14:58 -0700)]
ignore object files created in compilation

11 years agoMerge pull request #2 from makoshark/master
Erik Garrison [Mon, 6 Jun 2011 23:15:49 +0000 (16:15 -0700)]
Merge pull request #2 from makoshark/master

regex titles and static linking

11 years agomade changes necessary to build a static version
Benjamin Mako Hill [Mon, 2 May 2011 23:55:37 +0000 (19:55 -0400)]
made changes necessary to build a static version

Working with Bernie Innocenti, I completely overhauled the Makefile with
the goal of having it be able to build static versions of the library
which I need to run it on my cluster. In general, this has simplified
the Makefile. I had to make one small change to the disorder file in the

11 years agoadded support for regex title matching
Benjamin Mako Hill [Mon, 2 May 2011 19:56:48 +0000 (15:56 -0400)]
added support for regex title matching

Added support for regex matching of titles to (for example) match only
things from particular namespaces. In the process, I changed the "t"
option which prints out text to "d" for debug and used the t option for
taking title regexes.

11 years agoMerge pull request #1 from makoshark/master.
Erik Garrison [Mon, 2 May 2011 18:06:05 +0000 (11:06 -0700)]
Merge pull request #1 from makoshark/master.

Two Quick Fixes from Mako

11 years agolower case N in the usage instructions
Benjamin Mako Hill [Mon, 2 May 2011 18:01:09 +0000 (14:01 -0400)]
lower case N in the usage instructions

11 years agofix bug and invoke regex search in first revision
Benjamin Mako Hill [Mon, 2 May 2011 17:59:02 +0000 (13:59 -0400)]
fix bug and invoke regex search in first revision

Regex search was only being called in diffs. Of course, if something is
added in the very first revision, it will never show up in a diff. This
was resulting in deletions that don't show up as additions. I've patched
the code so that the first revision is treated as one big addition.

11 years agodynamically resize article text buffer if we would overflow
Erik Garrison [Tue, 15 Mar 2011 03:03:04 +0000 (23:03 -0400)]
dynamically resize article text buffer if we would overflow

11 years agoupdated README
Erik Garrison [Mon, 14 Mar 2011 00:49:54 +0000 (20:49 -0400)]
updated README

11 years agohelp text updates
Erik Garrison [Mon, 14 Mar 2011 00:37:24 +0000 (20:37 -0400)]
help text updates

11 years agoregex against article diffs and checksum-based revert detection
Erik Garrison [Mon, 14 Mar 2011 00:30:43 +0000 (20:30 -0400)]
regex against article diffs and checksum-based revert detection

11 years agoadded (broken, but running) diff routines for block-level diffs
Erik Garrison [Sun, 13 Mar 2011 18:17:39 +0000 (14:17 -0400)]
added (broken, but running) diff routines for block-level diffs

12 years agoadd md5sum, entropy
Erik Garrison [Fri, 28 Jan 2011 12:19:27 +0000 (07:19 -0500)]
add md5sum, entropy

12 years agoadded shannon_H entropy metric for each revision
Erik Garrison [Sun, 24 Oct 2010 20:28:15 +0000 (16:28 -0400)]
added shannon_H entropy metric for each revision

12 years agoresolved performance bug in cdata handling of revision text
Erik Garrison [Sun, 24 Oct 2010 20:10:01 +0000 (16:10 -0400)]
resolved performance bug in cdata handling of revision text

Previously, when appending character data to long text fields, each call
to charhndl required a call to strlen on the text field to which the new
character data was to be appended.  This is O(N^2)!  The problem was
most severe for vandalized articles with inlined HTML, but it ultimately
affected all data parsing as expat calls the charhndl function every
time it resolves a default XML entity.

By tracking the length of each field in our revisionData structure and
using a custom strncat-type function, it's possible to avoid this
overhead.  Now, we are pipe-buffer-bound when processing a 7z-compressed
mediawiki dump.  The current simple english wiki dump takes about 3
minutes to process on my 2x2.4ghz laptop, even when handling all text
data.  Just decompressing it to /dev/null takes around 1 minute.

12 years agobuffer storage of text data, structure cleanup
Erik Garrison [Sat, 23 Oct 2010 22:57:50 +0000 (18:57 -0400)]
buffer storage of text data, structure cleanup

Using a 2MB fixed-size buffer for the article text data speeds up
processing dramatically when text data is not thrown away.

12 years agoproperly implemented text and comment omission flag
Erik Garrison [Mon, 11 Oct 2010 17:53:07 +0000 (13:53 -0400)]
properly implemented text and comment omission flag

12 years agomoved from gcc to g++
Erik Garrison [Thu, 7 Oct 2010 12:46:50 +0000 (08:46 -0400)]
moved from gcc to g++

12 years agoRevert "changed compilation flags to -O3"
Erik Garrison [Thu, 7 Oct 2010 04:55:57 +0000 (00:55 -0400)]
Revert "changed compilation flags to -O3"

This reverts commit 315071f4419c2e34fd367e09d24818807353c91e.

12 years agochanged compilation flags to -O3
Erik Garrison [Thu, 7 Oct 2010 04:52:10 +0000 (00:52 -0400)]
changed compilation flags to -O3

12 years agoparseData -> revisionData
Erik Garrison [Thu, 7 Oct 2010 04:50:04 +0000 (00:50 -0400)]
parseData -> revisionData

12 years agofirst produced in the summer of 2006, finally published!
Erik Garrison [Thu, 7 Oct 2010 04:44:59 +0000 (00:44 -0400)]
first produced in the summer of 2006, finally published!

Benjamin Mako Hill || Want to submit a patch?