Programming contest postmortem

For those who haven't investigated this year's ICFP Programming contest, I'll provide a brief overview. We were given the task of saving Endo, an alien that crash landed on Earth and needed his »DNA« to be rewritten into a form that would allow him to survive the conditions on our planet. Endo, being a member of the Fuun species has a decidedly non-terrestrial biological basis.

Endo's »DNA« consists of a sequence of the bases I, C, F, and P. It has the most usual property that it acts both as an information storage mechanism and as catalyst for reactions. So perhaps the existence of the Fuun support the RNA world hypothesis, as we could say that their »DNA« does the same job as both our DNA and RNA. The catalytic process produces new sequences of »DNA« that provide the instructions for building a Fuun. We were given the information needed to express a Fuun's phenotype as 600 by 600 pixel images. Furthermore, we were provided with Endo's current »DNA« sequence, his current phenotype,

Endo’s source DNA

and a target phenotype

Target DNA for Endo

as suggested by his trusty and intelligent spacecraft Arrow. Our task was to synthesize a sequence that could be used as a prefix Endo's current genome, that would as closely »morph« Endo towards expressing this phenotype as possible.

Overall, I would say that this year's contest proved to be much more challenging than in previous years. However, I say this based primarily on the fact that (a) my team of two could not build a usable tool chain to start experimenting with Endo's genome within 48 hours and (b) the close of the closed the best team with public statistics had developed a sequence that would give Endo an 8% likelihood of survival. However, the overwhelming majority of the participants were well behind this. Not only was there the challenge of building a workable tool chain, but understanding Endo's genome well enough to reverse engineer it. I gather there were hints that became available once you had working tools, but I cannot really comment on their nature.

My team would have greatly benefited from additional minds to throw at the problem, but even so we did a few things sub-optimally. We initially divided the work at too high of a granularity. Brian start writing a DNA execution engine, while I built a tool for rendering the expressed phenotype. However, we should have concentrated more on writing smaller reusable components for the common structures to avoid duplication and make optimization easier.

The biggest problems we had were in debugging our DNA execution engine (or even figuring out whether it was buggy) and executing it efficiently enough that we could perform experiments. Brian's initial attempt was far too slow, though I can't remember exactly what structure he had chosen to use for representing sequences, other than it was partly based upon list. I then tried an approach that used an OCaml Bigarray to hold the maximal twenty-five megabyte working space for sequences during the simulation. However, this still proved to be too slow.

The next day, Brian tried some other approaches while I coded a new execution in OCaml from scratch, to have independent implementations for comparison and so that I could test some additional approaches to representing DNA sequences.Eventually I came to the conclusion that we taking a significant performance hit because of all the large copies we were performing. Knowing what sort of people the organizers were, I decided that perhaps the way they would have thought about the problem was to create a data type with constructors for each of the operations that the data needed to support and then begin working out algebraic laws for it. My initial version looked like

type seq = StringSeq of string
         | ConcatSeq of seq * seq
         | ConsSeq of base * seq
         | SnocSeq of seq * base
         | SliceSeq of seq * int * int

this proved promising, but eventually concluded that in practice ConsSeq was never really used. So I then switched to

type seq = StringSeq of string
         | ConcatSeq of seq * seq
         | SnocSeq of seq * base
         | SliceSeq of seq * int * int

By using this data type and taking advantage of the various algebraic properties I was able to come up with a implementations of concatenation, »snoc«, indexing, slicing, and computing the length of sequences that were reasonably efficient and worked without performing any copying (though I did introduce some rewrites that would copy a coalesce small StringSeq chunks). Incidentally, this turns out partly to have simply been reinventing the »rope« data structure that many participants found acceptably efficient.

However, the data structure did not account for the quotation operation on sequences. Attempting to quote instances of the seq data type in practice would require flattening the structure first, which would cause my implementation to die thrashing. Furthermore, I was unable to come up with a satisfactory method of treating quotation symbolically, because it does not commute with slicing.

In retrospect, and with an additional thirty-six hours behind me, I think I could have probably moved to a representation like

type seq = StringSliceSeq of string * int * int
         | ConcatSeq of seq * seq
         | SnocSeq of seq * base
         | QuoteSeq of seq

where QuoteSeq is lazily eliminated, and that would have probably resolved the problem. However, even then we would have had considerable challenges of ahead of us.

Still, I thought it was a really clever and interesting problem, and it was a refreshingly concrete problem to focus on after spending so much time on InforML and my dissertation.


  1. brian said,

    July 24, 2007 @ 7:07 am

    Yes, I blame my stupidity on inability to design proper data structures (and algorithms), something that I never quite mastered as an undergrad.

  2. washburn said,

    July 24, 2007 @ 7:09 am

    Well, I hope I didn’t make you sound stupid.

  3. brian said,

    July 24, 2007 @ 7:30 am

    You didn’t. I just think that it’s been far too long since I actually programmed a task where efficiency of representation and algorithms mattered in a significant way.

RSS feed for comments on this post · TrackBack URI

Leave a Comment