From: Erik Garrison Date: Sun, 24 Oct 2010 20:10:01 +0000 (-0400) Subject: resolved performance bug in cdata handling of revision text X-Git-Url: https://projects.mako.cc/source/wikiq/commitdiff_plain/86aeece4b3a7ffa495b4c09441cffeec8e579f34?hp=86aeece4b3a7ffa495b4c09441cffeec8e579f34 resolved performance bug in cdata handling of revision text Previously, when appending character data to long text fields, each call to charhndl required a call to strlen on the text field to which the new character data was to be appended. This is O(N^2)! The problem was most severe for vandalized articles with inlined HTML, but it ultimately affected all data parsing as expat calls the charhndl function every time it resolves a default XML entity. By tracking the length of each field in our revisionData structure and using a custom strncat-type function, it's possible to avoid this overhead. Now, we are pipe-buffer-bound when processing a 7z-compressed mediawiki dump. The current simple english wiki dump takes about 3 minutes to process on my 2x2.4ghz laptop, even when handling all text data. Just decompressing it to /dev/null takes around 1 minute. ---