How does similar_text work?

This was actually a very interesting question, thank you for giving me a puzzle that turned out to be very rewarding.

Let me start out by explaining how similar_text actually works.


Similar Text: The Algorithm

It’s a recursion based divide and conquer algorithm. It works by first finding the longest common string between the two inputs and breaking the problem into subsets around that string.

The examples you have used in your question, actually all perform only one iteration of the algorithm. The only ones not using one iteration and the ones giving different results are from the php.net comments.

Here is a simple example to understand the main issue behind simple_text and hopefully give some insight into how it works.


Similar Text: The Flaw

eeeefaaaaafddddd
ddddgaaaaagbeeee

Iteration 1:
Max    = 5
String = aaaaa
Left : eeeef and ddddg
Right: fddddd and geeeee

I hope the flaw is already apparent. It will only check directly to the left and to the right of the longest matched string in both input strings. This example

$s1='eeeefaaaaafddddd';
$s2='ddddgaaaaagbeeee';

echo similar_text($s1, $s2).'|'.similar_text($s2, $s1);
// outputs 5|5, this is due to Iteration 2 of the algorithm
// it will fail to find a matching string in both left and right subsets

To be honest, I’m uncertain how this case should be treated. It can be seen that only 2 characters are different in the string.
But both eeee and dddd are on opposite ends of the two strings, uncertain what NLP enthusiasts or other literary experts have to say about this specific situation.


Similar Text: Inconsistent results on argument swapping

The different results you were experiencing based on input order was due to the way the alogirthm actually behaves (as mentioned above).
I’ll give a final explination on what’s going on.

echo similar_text('test','wert'); // 1
echo similar_text('wert','test'); // 2

On the first case, there’s only one Iteration:

test
wert

Iteration 1:
Max    = 1
String = t
Left :  and wer
Right: est and 

We only have one iteration because empty/null strings return 0 on recursion. So this ends the algorithm and we have our result: 1

On the second case, however, we are faced with multiple Iterations:

wert
test

Iteration 1:
Max    = 1
String = e
Left : w and t
Right: rt and st

We already have a common string of length 1. The algorithm on the left subset will end in 0 matches, but on the right:

rt
st

Iteration 1:
Max    = 1
String = t
Left : r and s
Right:  and 

This will lead to our new and final result: 2

I thank you for this very informative question and the opportunity to dabble in C++ again.


Similar Text: JavaScript Edition

The short answer is: The javascript code is not implementing the correct algorithm

sum += this.similar_text(first.substr(0, pos2), second.substr(0, pos2));

Obviously it should be first.substr(0,pos1)

Note: The JavaScript code has been fixed by eis in a previous commit. Thanks @eis

Demystified!

Leave a Comment