Archive for September, 2008

Algorithmic puzzle

I was assigned the task of fixing a bug in the Scala standard library involving the indexOf, which given a receiver object that is a sequence and another sequence of the correct type, checks whether the latter is contained within the latter and returns the index. The current version does not correctly handle the case when a suffix of the receiver object matches a strict prefix of the argument (for example, List(1,2,3).indexOf(List(3,4)) will report a match at the index of 2). This should be fixed for the upcoming 2.7.2-RC2 release.

As soon as I started rewriting the code, I wondered why the original author hadn't just used an off the shelf searching algorithm. However, a quick search reminded me why: algorithms like Knuth-Morris-Pratt and Boyer-Moor construct a table based upon the sequence to search for. However, Scala sequences may be infinite so it is not possible to blindly go ahead and attempt to construct a table, because doing so may diverge.

Furthermore, there is no way to test whether a sequence is finite without potentially diverging. So it is not possible to first test whether the argument is finite, because if the receiver object is finite then indexOf should return that there is no match. Alternately, testing whether the receiver object is finite would be incorrect because it is possible the argument is finite an could potentially match.

However, it seems like it should still be possible to do better than O(nm), where n is the length of the receiver and m the length of the argument. For example if you start out with the sequence 1, 2, 3, 1 ... and the pattern 1, 3, 4 ... it seems like it should be possible to exploit the fact that you've looked ahead and know that there is no point and comparing 2 with 1. Alternately it seems like it might be possible to lazily build a table from the argument, but I would need to think longer to see whether it is always possible, in Knuth-Morris-Pratt for example, to fill in a complete prefix of the table without having processed the entire pattern.

In any event, searching with combinations of keywords like "string", "searching", "lazy", "infinite", etc. did not really turn anything up. One possible direction might be to look at "incremental" search algorithms like those used in text editors, etc. However, I expect that because they are geared to interactive use that the pattern will usually be quite small and therefore much thought has not been put into optimizing them.

Comments (6)