Skip to content

contextual lexer accept set seems wrong #1581

@MartyMcFlyInTheSky

Description

@MartyMcFlyInTheSky

Trying to parse this AIREP:

UAAA01 EGRR 031514
VPFBQ 6600S 06834W 1320 F200 MS32 290/36
VPFBQ 6400S 06903W 1356 F200 MS30 280/33
VPFBQ 6200S 06928W 1433 F200 MS30 310/18
VPFBQ 6000S 06950W 1514 F200 MS30 00/38

with this grammar:

%import common.WS_INLINE
%import common.NEWLINE
%ignore WS_INLINE
%ignore NEWLINE


?start: airep_tac

// significant newline has higher prio than NEWLINE
_NL.2: /\n/

airep_tac: header_line designator_line? (airep_block | airep_block_snl)


// -------------------- WMO header --------------------
header_line: message_type issuing_office issue_time correction* _NL

message_type: TTAAII
issuing_office: CCCC
issue_time: YYGGGG
correction: BBB

TTAAII: /U[A-Z]{3}[0-9]{2}(?![A-Z0-9])/
CCCC: /[A-Z]{4}(?![A-Z])/ 
YYGGGG: /[0-9]{6}(?![0-9])/ 
BBB: /[A-Z]{3}/

// -------------------- description line --------------------
designator_line: AIREP date? _NL

AIREP.2: "AIREP"

date: DDHH
DDHH: /\d{4}/

// -------------------- airep_blocks --------------------

// airep blocks using ARP/ARS as record seperator
airep_block: airep_line+
airep_line: msg_type_designator airplane_id loc_ref REST+

REST: /(?!ARP|ARS)[^\s]+/

msg_type_designator: ARP | ARS
ARP: "ARP"
ARS: "ARS"

airplane_id: /[A-Z0-9]{4,7}/

loc_ref: latlon_ddmm

latlon_ddmm: LAT_DD LAT_MM LAT_HEM LON_DDD LON_MM LON_HEM
LAT_DD.5: /\d{2}(?=\d{2}[NS]\s*\d{5}[EW])/
LAT_MM.5: /\d{2}(?=[NS]\s*\d{5}[EW])/
LAT_HEM.5: /[NS]/
LON_DDD.5: /\d{3}(?=\d{2}[EW])/
LON_MM.5: /\d{2}(?=[EW])/
LON_HEM.5: /[EW]/

// airep blocks using significant newlines as record seperator
airep_block_snl: airep_line_snl+
airep_line_snl: airplane_id loc_ref REST_SNL+ _NL

REST_SNL: /[^\n]+/

the lexer still has the REST terminal in his accept set, which is what I don't understand? Here's the concrete error:

E               lark.exceptions.UnexpectedToken: Unexpected token Token('REST', '1320') at line 2, column 20.
E               Expected one of:
E                       * REST_SNL

But this seems crazy to me, after all the parse table should not even consider the REST token if I understand how the contextual lexer works. The REST token should only be a valid choice in the airep_line derivations, not in airep_line_snl-derivations. So I assumed the parse table would be built like this, but a bit of logging using the interactive parser shows:

Parser choices:
        - REST_SNL -> (Reduce, Rule(NonTerminal(Token('RULE', 'latlon_ddmm')), [Terminal('LAT_DD'), Terminal('LAT_MM'), Terminal('LAT_HEM'), Terminal('LON_DDD'), Terminal('LON_MM'), Terminal('LON_HEM')], None, RuleOptions(False, False, None, None)))
        - REST -> (Reduce, Rule(NonTerminal(Token('RULE', 'latlon_ddmm')), [Terminal('LAT_DD'), Terminal('LAT_MM'), Terminal('LAT_HEM'), Terminal('LON_DDD'), Terminal('LON_MM'), Terminal('LON_HEM')], None, RuleOptions(False, False, None, None)))
stack size: 9
EXPECTED: ['REST_SNL']
NEXT: 1320
LAST_OK: W
F

After parsing LON_HEM he still has both tokens up for the grabs. Why?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions