]> projects.mako.cc - wikiq/log
wikiq
13 years agoregex against article diffs and checksum-based revert detection
Erik Garrison [Mon, 14 Mar 2011 00:30:43 +0000 (20:30 -0400)]
regex against article diffs and checksum-based revert detection

13 years agoadded (broken, but running) diff routines for block-level diffs
Erik Garrison [Sun, 13 Mar 2011 18:17:39 +0000 (14:17 -0400)]
added (broken, but running) diff routines for block-level diffs

13 years agoadd md5sum, entropy
Erik Garrison [Fri, 28 Jan 2011 12:19:27 +0000 (07:19 -0500)]
add md5sum, entropy

14 years agoadded shannon_H entropy metric for each revision
Erik Garrison [Sun, 24 Oct 2010 20:28:15 +0000 (16:28 -0400)]
added shannon_H entropy metric for each revision

14 years agoresolved performance bug in cdata handling of revision text
Erik Garrison [Sun, 24 Oct 2010 20:10:01 +0000 (16:10 -0400)]
resolved performance bug in cdata handling of revision text

Previously, when appending character data to long text fields, each call
to charhndl required a call to strlen on the text field to which the new
character data was to be appended.  This is O(N^2)!  The problem was
most severe for vandalized articles with inlined HTML, but it ultimately
affected all data parsing as expat calls the charhndl function every
time it resolves a default XML entity.

By tracking the length of each field in our revisionData structure and
using a custom strncat-type function, it's possible to avoid this
overhead.  Now, we are pipe-buffer-bound when processing a 7z-compressed
mediawiki dump.  The current simple english wiki dump takes about 3
minutes to process on my 2x2.4ghz laptop, even when handling all text
data.  Just decompressing it to /dev/null takes around 1 minute.

14 years agobuffer storage of text data, structure cleanup
Erik Garrison [Sat, 23 Oct 2010 22:57:50 +0000 (18:57 -0400)]
buffer storage of text data, structure cleanup

Using a 2MB fixed-size buffer for the article text data speeds up
processing dramatically when text data is not thrown away.

14 years agoproperly implemented text and comment omission flag
Erik Garrison [Mon, 11 Oct 2010 17:53:07 +0000 (13:53 -0400)]
properly implemented text and comment omission flag

14 years agomoved from gcc to g++
Erik Garrison [Thu, 7 Oct 2010 12:46:50 +0000 (08:46 -0400)]
moved from gcc to g++

14 years agoRevert "changed compilation flags to -O3"
Erik Garrison [Thu, 7 Oct 2010 04:55:57 +0000 (00:55 -0400)]
Revert "changed compilation flags to -O3"

This reverts commit 315071f4419c2e34fd367e09d24818807353c91e.

14 years agochanged compilation flags to -O3
Erik Garrison [Thu, 7 Oct 2010 04:52:10 +0000 (00:52 -0400)]
changed compilation flags to -O3

14 years agoparseData -> revisionData
Erik Garrison [Thu, 7 Oct 2010 04:50:04 +0000 (00:50 -0400)]
parseData -> revisionData

14 years agofirst produced in the summer of 2006, finally published!
Erik Garrison [Thu, 7 Oct 2010 04:44:59 +0000 (00:44 -0400)]
first produced in the summer of 2006, finally published!

Benjamin Mako Hill || Want to submit a patch?