Making up with ANTLR

May 29, 2009

I like ANTLR! It is a specialized tool that can really be applied to many difficult tasks when regular expressions get all Dust Puppy like. And I have used it in the past with great success.

But, every time I put this particular tool aside, I know that picking it back up will be like making up after a bad break up. Things feel familiar, but you are still so uncomfortable you cannot get anything working. Only knowing how great the tool is underneath, makes me go through the effort of re-familiarization.

I just downloaded ANTLR 3.1.2 bundled with its own GUI ANTLRWorks that offers visual diagrams, debugger and templates. You would think that would make for an easy out-of-box experience. You would be wrong.

You start the GUI and end up facing a blank screen. Lots of options and tabs for sure, but the only easy start one seems to be ‘Insert rule from template’.

Ok, so here is a couple of rules from templates trying to parse “Hello World!” string:

ID : LETTER (LETTER | DIGIT)*

;

LETTER

: ‘a’..‘z’ | ‘A’..‘Z’

;

DIGIT : ‘0’..‘9’;

;

WS : (’ ' | ‘\t’ | ‘\n’ | ‘\r’) { $setType(Token.SKIP); }

;

Not good. We are missing a start state apparently. Ok, let’s add one:

hello : ID ID ‘!’

;

Still no good. Start looking at examples, trying to see what bits are compulsory. Ok, the word grammar is missing at the top of the file. Of course, I have both grammar and lexer elements now in one file (ANTLR 3 feature, I believe), but let’s not worry about deep meaning here.

grammar test;

Now, suddenly, syntax diagram starts showing up. Let’s try saving (as test.g) and compiling. No good:

The following token definitions can never be matched because prior tokens match the same input: LETTER

So much for following a template. More digging in examples. Memory really starts to bring back the Dragon Book’s lessons. What’s the problem with LETTER and who is the prior token here. Ah, we don’t want the lexer to return LETTER (or DIGIT), only ID. So, LETTER and DIGIT are both token fragments, not tokens. Add fragment in front of both definitions. All good?

Nope! Now we have a problem with:

attribute is not a token, parameter, or return value: setType

But I did not write setType, the template provided it! Back to the examples! Apparently, somewhere along the way Skip tokens have gone away and we now have hidden channels instead. Swap that bit with one from an example and try again.

SUCCESS. Switch to interpreter, enter “Hello World!” in input box and run hello rule. Beauty, we have a parse diagram.

The final running grammar example is here:

grammar test;

hello : ID ID ‘!’

;

ID : LETTER (LETTER | DIGIT)*

;

fragment LETTER

: ‘a’..‘z’ | ‘A’..‘Z’

;

fragment DIGIT : ‘0’..‘9’

;

WS : (’ ' | ‘\t’ | ‘\n’ | ‘\r’) { $channel = HIDDEN; }

;

Hello World! Now, on to the real grammar and (if things really, really work) GATE integration…..