ANTLR 4.5 – Mismatched Input ‘x’ expecting ‘x’

This seems to be a common misunderstanding of ANTLR:

Language Processing in ANTLR:

The Language Processing is done in two strictly separated phases:

  • Lexing, i.e. partitioning the text into tokens
  • Parsing, i.e. building a parse tree from the tokens

Since lexing must preceed parsing there is a consequence: The lexer is independent of the parser, the parser cannot influence lexing.

Lexing

Lexing in ANTLR works as following:

  • all rules with uppercase first character are lexer rules
  • the lexer starts at the beginning and tries to find a rule that matches best to the current input
  • a best match is a match that has maximum length, i.e. the token that results from appending the next input character to the maximum length match is not matched by any lexer rule
  • tokens are generated from matches:
    • if one rule matches the maximum length match the corresponding token is pushed into the token stream
    • if multiple rules match the maximum length match the first defined token in the grammar is pushed to the token stream

Example: What is wrong with your grammar

Your grammar has two rules that are critical:

FILEPATH: ('A'..'Z'|'a'..'z'|'0'..'9'|':'|'\\'|"https://stackoverflow.com/"|' '|'-'|'_'|'.')+ ;
TITLE: ('A'..'Z'|'a'..'z'|' ')+ ;

Each match, that is matched by TITLE will also be matched by FILEPATH. And FILEPATH is defined before TITLE: So each token that you expect to be a title would be a FILEPATH.

There are two hints for that:

  • keep your lexer rules disjunct (no token should match a superset of another).
  • if your tokens intentionally match the same strings, then put them into the right order (in your case this will be sufficient).
  • if you need a parser driven lexer you have to change to another parser generator: PEG-Parsers or GLR-Parsers will do that (but of course this can produce other problems).

Leave a Comment