Skip to content

Wikia/wikidiff2

Repository files navigation

Wikidiff2 is partly based on the original wikidiff, partly on DifferenceEngine.php, and partly Tim Starling's own work. It performs word-level (space-delimited) diffs on general text, and character-level diffs on text composed of characters from the Japanese and Thai alphabets and the unified han. Japanese, Chinese and Thai do not use spaces to separate words. The input is assumed to be UTF-8 encoded. Invalid UTF-8 may cause undesirable operation, such as truncation of the output, so the input should be validated by the application. The input text should have unix-style line endings.

The output is an HTML fragment -- a number of HTML table rows with the rest of the document structure omitted. The characters "<", ">" and "&" will be HTML-escaped in the output.

A number of test files are provided:

* test-a1 and test-a2, expected output test-a.diff
* test-b1 and test-b2, expected output test-b.diff

These pairs together give 85.9% coverage of DiffEngine.h. The remaining untested part is mostly in _DiffEngine::_shift_boundaries(). 

* chinese-reverse-1.txt and chinese-reverse-2.txt (in chinese-reverse.zip)

These files are 2.3MB each, and give a worst-case performance test. Performance in the worst case is sensitive to the performance of the associative array class used to cross-reference the strings. I tried using an STL map and a Judy array. The Judy array gave an 11% improvement in execution time over the map, which could probably be increased to 15% with further optimisation work. I don't consider that to be a sufficient improvement to warrant adding a library dependency, but the code has been left in for the benefit of Judy fans and performance perfectionists. It can be enabled by compiling with -DUSE_JUDY. The C++ wrapper for JudyHS might be of use to someone.

Wikidiff2 is a PHP extension.

== Dependencies ==
It requires the following library:

* libthai, a Thai language support library
  http://linux.thai.net/plone/TLWG/libthai/
  On Debian-based systems, you need libthai0 and libthai-dev packages

* To build wikidiff2 as a HHVM extension on Debian systems, you need the following packages:
        hhvm-dev libtbb-dev libtbb2 libboost-all-dev libdouble-conversion-dev \
        libdouble-conversion1 libgoogle-glog-dev libgoogle-glog-doc libgoogle-glog0 \
        libjemalloc-dev libjemalloc1 libjemalloc1-dbg

* To build wikidiff2 as a PHP extension, you need the following packages:
** On Jessie and previous versions:
        php5-dev pkg-config
** On Stretch and later versions:
        php-dev pkg-config


== Compilation and installation with Zend PHP ==

$ phpize
$ ./configure
$ make
$ sudo make install

== Compilation and installation with HHVM ==

$ hphpize
$ cmake .
$ make
$ sudo make install

== LocalSettings configuration ==

ini_set("wikidiff2.moved_line_threshold", <float>);

Wikidiff2 estimates similarity of added and deleted lines based on changed character count. When the similarity of an added and deleted line is greater than this threshold, the lines are displayed as moved. 
Range 0.0 .. 1.0. Default 0.4.

ini_set("wikidiff2.change_threshold", <float>);

Changed lines with a similarity value below this threshold will be split into a deleted line and added line. This helps matching up moved lines in some cases.
Range 0.0 .. 1.0. Default 0.2.

$wgWikiDiff2MovedParagraphDetectionCutoff = <int>;

When the number of added and deleted lines in a diff is greater than this limit, no attempt to detect moved lines will be made.
Default 30.

vim: wrap