Interesting problem

Is it possible to compute a regular expression over byte sequences to match against specific ranges of Unicode glyphs encoded using UTF-8? This comes up when you're stuck using lexer generators that only understand byte sequences.

I have to assume that it must be possible, because at worst you can just union together the UTF-8 byte sequences for every glyph in the range. However, given how large some ranges can be, this would probably be very inefficient. The better question is whether or not there is a smarter way of constructing your regular-expression or DFA. Though right now I'm not inclined to think about it too deeply as I know if it really comes down to doing this, I can just do it the blindingly stupid way and write some DFA minimization code.


  1. kitby said,

    October 17, 2006 @ 7:15 pm

    Is this page at all helpful?

  2. washburn said,

    October 17, 2006 @ 7:18 pm

    A little. It gives a regular expression for matching valid UTF-8 sequences. But it will take some additional thinking to pare that down to specific glyph ranges.

  3. aaron said,

    October 17, 2006 @ 9:46 pm

    We thought about this for ml-ulex, but decided it was much simpler to put a hand-written UTF-8 -> Int decoder in front of the lexer, and then have the lexer (internally) work over words rather than bytes…

  4. washburn said,

    October 17, 2006 @ 9:54 pm

    Ah, I had been meaning to send you some e-mail about this. I just grabbed the latest version of SML/NJ this afternoon and I didn’t see anything about Unicode lexing. Though right now, I’m only going to add a couple tokens to start with (I’ve run out of delimiters…) so I can probably just hardcode the UTF-8 for those.

  5. aaron said,

    October 17, 2006 @ 10:33 pm

    I know that it’s been *forever*, but in the meantime we’ve written a completely new, LL(k) parser generator. Actually this week we are planning to put out 110.60, which will include the new tools. I will send you mail when this happens.

  6. washburn said,

    October 17, 2006 @ 10:56 pm

    Sounds great. Have you experimented with converting the SML grammar from LALR(1) to LL(k) yet? If so that would be a invaluable reference for me to work from.

  7. aaron said,

    October 17, 2006 @ 11:09 pm

    That’s next on my list, as soon as I finish the draft manual. We’ve had good success building a replacement ckit parser. I am hoping to provide an analog to ckit for SML, which would be pretty exciting given that the new parser supports extensible grammars. Anyway, I should have a chance to play with an SML grammar before the release, so I’ll send you whatever I end up with.

RSS feed for comments on this post · TrackBack URI

Leave a Comment