51648

ANTLR lexer rule consumes too much

Question:

ANTLR Lexer Rule Design

I have a requirement for the following token:

<ul><li>Allowable characters include uppercase, lowercase, numeric, space, and hyphen characters</li> <li>Unfixed length (must be at least two characters in length)</li> <li>Token must contain at least one space or hyphen</li> <li>Token must start and end in an uppercase, lowercase, numeric, space, or hyphen character (cannot begin or end with a space)</li> </ul>

The ANTLR lexer rule "AlphaNumericSpaceHyphen" in the grammar below almost works except for one case. Using the parser rule "sic" to test, the following input will parse (without quotes):

"STANDARD INDUSTRIAL CLASSIFICATION: WATER TRANSPORTATION[4400]"

The following input fails to parse (without quotes):

"STANDARD INDUSTRIAL CLASSIFICATION: WATER TRANSPORTATION [4400]"

The issue being that the lexer rule "AlphaNumericSpaceHyphen" consumes the space and the left square bracket after "WATER TRANSPORTATION" before the lexer realizes that there is no match because it went too far.

I have experimented with various type of predicates and look aheads without any luck. Any help is greatly appreciated.

grammar T; sic: SICSpecifier AlphaNumericSpaceHyphen LEFTBRACKET Digits RIGHTBRACKET; LEFTBRACKET : '['; RIGHTBRACKET : ']'; SICSpecifier: 'STANDARD INDUSTRIAL CLASSIFICATION:'; WS : (' '|'\t')+ { $channel = HIDDEN; }; fragment UCASEALPHA : 'A'..'Z'; fragment LCASEALPHA : 'a'..'z'; fragment DIGIT : '0'..'9'; Digits: DIGIT+; AlphaNumericSpaceHyphen : (UCASEALPHA|LCASEALPHA |DIGIT|'-')+ (' ' (UCASEALPHA|LCASEALPHA |DIGIT|'-')+)+ | (UCASEALPHA|LCASEALPHA |DIGIT)+ ('-')+ ((' '|UCASEALPHA|LCASEALPHA |DIGIT|'-')* (UCASEALPHA|LCASEALPHA |DIGIT|'-'))? | ('-')+ (UCASEALPHA|LCASEALPHA |DIGIT)+ ((UCASEALPHA|LCASEALPHA |DIGIT|'-'|' ')* (UCASEALPHA|LCASEALPHA |DIGIT|'-'))? ;

Answer1:

Unfortunately there is no backtracking for the lexer rules. You can take a look at

<a href="https://stackoverflow.com/questions/10136324/antlr-lexer-rule-consumes-characters-even-if-not-matched" rel="nofollow">ANTLR lexer rule consumes characters even if not matched?</a>

You can try to adapt your grammar so that you can change the type of the token as it is suggested in this solution.

Hope this is going to help you.

Recommend

  • Scanning the log files for last 30 minutes of data
  • Multiple arguments for a PHP function
  • Antlr4 - Is there a simple example of using the ParseTree Walker?
  • creating a UI in background thread WPF?
  • How to access COM objects from different apartment models?
  • Cannot run nunit tests with Nant
  • WPF and background worker and the calling thread must be STA
  • Why do doubles not correctly parse from my String[] array?
  • Error-tolerant XML parsing in Scala
  • PyParsing: Is it possible to globally suppress all Literals?
  • Docker container doesn't start, showing as 'Exited n seconds ago'
  • C# OpenFileDialog Thread start but dialog not shown
  • Basic problem with yacc/lex
  • Mutual Left Recursion ANTLR 4
  • C++ boost::spirit parsing embedded languages
  • Zeromq with python hangs if connecting to invalid socket
  • Using HTML/CSS for UI in XNA?
  • How do I translate LR(1) Parse into a Abstract syntax tree?
  • C function strchr - How to calculate the position of the character?
  • F#: In which memory area is the continuation stored: stack or heap?
  • Is it possible to run clang with llc flags
  • Trying to get the char code of ENTER key
  • Adding elements to a huge XML file
  • Apache RewriteRule redirection with url encoded
  • How integrated is Collada to OpenGL ES
  • Jackson Parser: ignore deserializing for type mismatch
  • preg_replace Double Spaces to tab (\\t) at the beginning of a line
  • How to recover from a Spring Social ExpiredAuthorizationException
  • ILMerge & Keep Assembly Name
  • How to make Safari send if-modified-since header?
  • Large data - storage and query
  • WOWZA + RTMP + HTML5 Playback?
  • How to pass list parameters for each object using Spring MVC?
  • using conditional logic : check if record exists; if it does, update it, if not, create it
  • python regex in pyparsing
  • CSS Applying specific rule for a specific monitor resolution with only CSS is posible?
  • Setting background image for body element in xhtml (for different monitors and resolutions)
  • Android Google Maps API OnLocationChanged only called once
  • JaxB to read class hierarchy
  • How can I use threading to 'tick' a timer to be accessed by other threads?