Algorithmic puzzle

I was assigned the task of fixing a bug in the Scala standard library involving the indexOf, which given a receiver object that is a sequence and another sequence of the correct type, checks whether the latter is contained within the latter and returns the index. The current version does not correctly handle the case when a suffix of the receiver object matches a strict prefix of the argument (for example, List(1,2,3).indexOf(List(3,4)) will report a match at the index of 2). This should be fixed for the upcoming 2.7.2-RC2 release.

As soon as I started rewriting the code, I wondered why the original author hadn't just used an off the shelf searching algorithm. However, a quick search reminded me why: algorithms like Knuth-Morris-Pratt and Boyer-Moor construct a table based upon the sequence to search for. However, Scala sequences may be infinite so it is not possible to blindly go ahead and attempt to construct a table, because doing so may diverge.

Furthermore, there is no way to test whether a sequence is finite without potentially diverging. So it is not possible to first test whether the argument is finite, because if the receiver object is finite then indexOf should return that there is no match. Alternately, testing whether the receiver object is finite would be incorrect because it is possible the argument is finite an could potentially match.

However, it seems like it should still be possible to do better than O(nm), where n is the length of the receiver and m the length of the argument. For example if you start out with the sequence 1, 2, 3, 1 ... and the pattern 1, 3, 4 ... it seems like it should be possible to exploit the fact that you've looked ahead and know that there is no point and comparing 2 with 1. Alternately it seems like it might be possible to lazily build a table from the argument, but I would need to think longer to see whether it is always possible, in Knuth-Morris-Pratt for example, to fill in a complete prefix of the table without having processed the entire pattern.

In any event, searching with combinations of keywords like "string", "searching", "lazy", "infinite", etc. did not really turn anything up. One possible direction might be to look at "incremental" search algorithms like those used in text editors, etc. However, I expect that because they are geared to interactive use that the pattern will usually be quite small and therefore much thought has not been put into optimizing them.

6 Comments »

  1. Walter Chang said,

    September 8, 2008 @ 11:19 am

    i dont have an answer for you but do have one suggestion, though. please provide overridden versions so that indexOf on finite seq can be fast and on infinite seq correct. thanks!

  2. washburn said,

    September 8, 2008 @ 11:26 am

    @Walter: Sorry, but someone else will have to do that. Frankly, I shouldn’t even really be working on this bug at all, as I have a lot of work on Scala Classic to do and the deadlines are already too close.

  3. thom said,

    September 8, 2008 @ 8:46 pm

    i don’t know enough scala to give a proper answer. most java code that has to deal with a situation like this however would use instanceOf for example:

    int getSize(Iterable iterable) {
    if (iterable instanceof Collection) {
    return ((Collection) iterable).size();
    } else {
    // loop…
    }
    }

    can you distinguish between definitely finite and potentially infinite sequences in a similar way using instanceof in scala?

  4. washburn said,

    September 8, 2008 @ 8:55 pm

    @thom: No, we can’t. The Seq type already has a length method, it is just not guaranteed to terminate. Furthermore, we can’t use instanceOf to decide what to do because we can’t know ahead of time all possible classes that might mix in the Seq trait.

  5. Matthew said,

    September 13, 2008 @ 12:39 am

    This any use?

    http://haskell.org/pipermail/haskell-cafe/2007-August/029816.html

    http://twan.home.fmf.nl/blog/haskell/Knuth-Morris-Pratt-in-Haskell.details

  6. washburn said,

    September 15, 2008 @ 9:14 pm

    @Matthew: I haven’t had a chance to think through all of the code presented, but it does at least upon a naïve inspection seem to have the correct behavior and running time. So if nothing else it looks like a good start. Thanks!

RSS feed for comments on this post · TrackBack URI

Leave a Comment