Parser generators considered harmful?

Okay, I'm being a little sarcastic, but I was just thinking that while parser generators make the lives of us that implement languages frequently easier, are they really the best thing for »production« language implementations. Mostly I'm thinking about this in terms of error reporting to the user rather than in terms of performance. In particular, during a discussion in PLClub on Friday there was a question of whether the new breed of parser generators become available (elkhound, frown, ml-antlr, sugar, etc.) whether the ones that offer or implicitly provide arbitrary lookahead can actually produce comprehensible error messages. At worst, you will be facing an O(nk) number of ways to resolve a rule (where n is the number of tokens in your language and k the lookahead). Maybe it doesn't turn out to be that bad in practice. Still it seems like one could perhaps provide the most informative error messages by writing the parser by hand.

Of course the argument against, is quite similar to the arguments for not programming in assembly language when it gives you the ability to achieve maximal performance. Using a parser generator allows you to express the parsing process in a more abstract way which means that you get all the nice things like easier maintenance, it is possible to actually perform analysis of your grammar. I guess the issue is that perhaps some more thought should be put into giving developers a way to »profile« their grammars (»is there anywhere where lookahead gets unmanageable?«, etc.) and more sophisticated methods for customizing the error reporting process (for example ml-yacc only reports errors in terms of its tokens, which are not necessarily meaningful to the user).

Anyway, I'm about to give Aaron's new lexing and parsing tools for SML/NJ a try. I'm not sure whether I will make the leap from ml-yacc to ml-antlr just yet, but I'll probably give ml-ulex a try, just because ml-lex doesn't allow comments in the rule section of the file.

I should spend some time reading over the Dragon Book tomorrow just to remind myself about the trade-offs with traditional parsing techniques. There actually turns out to be a second edition of the Dragon Book that was just released, but I didn't find the first edition useful enough to pay a 100 USD for the new one yet. I admittedly remember very little of what was covered concerning parsing algorithms in my undergraduate compilers course.

2 Comments »

  1. aaron said,

    November 5, 2006 @ 4:00 pm

    Just a couple of remarks:

    I think most k>1 parsers use a more restricted notion of lookahead. For ml-antlr, lookahead is not a k-tuple of tokens, but rather a single token that may appear up to k tokens ahead. Well, that’s roughly the story, anyway — in reality things are a bit more complicated. But, at any rate, this technique gives you something like size O(n*k) for your lookahead decision. In practice, this works quite well, and you usually need only k = 3 or so. We provide selective backtracking when you need more power.

    More broadly, depending on your error handling technique, I don’t think the lookahead strategy has much of a bearing. For ml-antlr, we use Burke-Fisher error repair, which is the same strategy used in ml-yacc. Basically, when an error is detected, the parser “backs up” about, say, 20 tokens, and tries EVERY single token change possible in those 20 tokens: insertions, deletions, and substitutions. For each attempt, it sees how much farther the parser gets, and chooses the best correction according to some heuristic. Although this sounds very expensive, it really isn’t very bad, and the penalty is only incurred when there is an error and you’re willing to spend time analyzing it anyway. This technique is essentially parser-agnostic: all it’s doing is using the parser to try various permutations of the input.

    A yet broader point. One thing that I think is tremendously hard to do with a handwritten scheme is “global”, or at least nonlocal, error recovery. The technique I just described back up ~20 tokens because it is often the case that the “real” error is some distance behind the point at which it was detected.

    Finally, ml-antlr is in some respect a meta-language for writing recursive-descent parsers, which is what you’d be writing by hand anyway. The code it generates is, as a result, actually fairly readable. Furthermore, this approach makes it possible to get the best of both worlds: you can add support in the ml-antlr meta-language for custom error handling, and very easily include that in the generated code. This is, in fact, exactly what Parr’s antlr tool does. We don’t yet support this in ml-antlr, but it is definitely planned.

    So, I think parser generators should be considered helpful — but, obviously, I’m rather biased in that respect. 🙂

  2. kitby said,

    November 5, 2006 @ 7:11 pm

    There might also be question of how difficult your language is to parse for actual people.

RSS feed for comments on this post · TrackBack URI

Leave a Comment