diff --git a/README.md b/README.md index b5a2f08..05d3a3d 100644 --- a/README.md +++ b/README.md @@ -1,11 +1,51 @@ # @jsonic/c A [Jsonic](https://jsonic.senecajs.org) plugin that parses **C source code** -into a **concrete syntax tree** — preserving every token, comment, macro -definition, macro use, and compiler extension as-is. - -Targets **C23** plus the common **GCC / Clang / MSVC** extensions, with -best-effort handling of preprocessor conditional groups. +into a **concrete syntax tree** — preserving every token and comment as-is. + +Targets **plain C23**: keywords, punctuators, literals, declarations / +definitions / statements / expressions, C23 attributes (`[[…]]`), typedef +tracking. + +Opt in with `{extended: true}` for: + +- **Top-level preprocessor**: `#define`, `#undef`, `#include`, + `#pragma` / `#error` / `#warning` / `#line`, and the `#if` / + `#elif` / `#else` / `#endif` family. Each `#`-line is its own + typed CST node. `#if` … `#endif` runs are folded into a single + `conditional_group` with one `conditional_branch` per + `#if`/`#ifdef`/`#ifndef`/`#elif`/`#else` directive plus a closing + `endif` slot — nested `#if … #endif` inside a branch recursively + produces a nested `conditional_group`. Stray `#endif` / + unterminated `#if` degrade gracefully (group with no `endif` + field; orphan directives stay flat). `#define` populates a macro + table that the lexer consults to tag later identifiers as + `MACRO_NAME`; `#undef` reverts. +- **GCC `__attribute__((…))`** and **MSVC `__declspec(…)`** at + declaration scope: leading position before storage / type heads, + interleaved among specifiers, and **post-declarator** (after the + declarator's name / array postfix / function postfix, before the + `=` initializer or `,` / `;` terminator). All produce an + `attribute_spec` node with `attributeForm: 'gcc'|'msvc'` carrying + typed `attribute_item` children. The post-declarator slot also + accepts the C23 `[[…]]` form (`int z [[deprecated]];`) without + needing the `extended` flag. +- **GCC inline assembly** (`__asm__` / `asm` / `__asm`) at top level + and inside function bodies. Produces an `asm_statement` node with + optional qualifier tokens (`volatile`, `goto`, `inline`), + `asm_template`, and four section slots + (`asm_outputs` / `asm_inputs` / `asm_clobbers` / `asm_labels`) + containing typed item nodes (`asm_operand`, `asm_clobber`, + `asm_label_ref`). +- **In-body preprocessor lines**: `#`-lines that appear inside + function bodies (rare but legal — e.g. mid-body `#pragma`, + `#ifdef`, `#error`) are captured as a `preprocessor_line` node + containing the flat token sequence up to and including the + trailing `PP_NEWLINE`. The lexer requires `#` to start a logical + line. + +All listed extension shapes are now shipped under +`{extended: true}`. ## Quick start @@ -30,24 +70,20 @@ positions are preserved on token spans). ## Architecture - **Focused lex matchers** (`src/matchers.ts`): one matcher per concept — - whitespace, line continuation, line/block comments, preprocessor - directive opener (line-start gated), directive newline, header name, - identifier (with keyword/typedef-name/macro-name reclassification), - integer/float/char/string literals, and longest-match punctuator - dispatch. - -- **Symbol & macro tables** (`src/symbols.ts`): scope stack and macro - lookup live on `ctx.meta.cmeta` so both lex matchers and rule - actions share state. Lex matchers consult the tables when - classifying identifiers; rule actions register names when they - finalize a `typedef` or `#define`. Pre-lexed lookahead tokens are - reclassified in place so the very next match sees the updated - classification immediately. - -- **Token catalog** (`src/tokens.ts`): every C23 keyword, every - compiler-extension keyword, and every punctuator gets its own named - token. Grammar rules and structuring code reference these names - directly. + whitespace, line continuation, line/block comments, identifier (with + keyword/typedef-name reclassification), integer/float/char/string + literals, and longest-match punctuator dispatch. + +- **Symbol table** (`src/symbols.ts`): scope stack lives on + `ctx.meta.cmeta` so both lex matchers and rule actions share state. + Lex matchers consult the table when classifying identifiers; rule + actions register names when they finalise a `typedef`. Pre-lexed + lookahead tokens are reclassified in place so the very next match + sees the updated classification immediately. + +- **Token catalog** (`src/tokens.ts`): every C23 keyword and every + punctuator gets its own named token. Grammar rules and structuring + code reference these names directly. - **Declarative grammar** (`c-grammar.jsonic`): the rule shapes for the entire C surface — translation unit, external declarations, @@ -59,50 +95,20 @@ positions are preserved on token spans). - **Pratt-style expressions** via [`@jsonic/expr`](https://www.npmjs.com/package/@jsonic/expr): the `val` rule absorbs C atoms (`LIT_INT` / `LIT_FLOAT` / `LIT_CHAR` - / `LIT_STRING` / `ID` / `MACRO_NAME` / `TYPEDEF_NAME` / `KW_NULLPTR` - / `KW_TRUE` / `KW_FALSE`), then `@jsonic/expr`'s pratt logic - drives infix / prefix / suffix operator precedence. Custom val - open-alts handle the C-only constructs that aren't simple - operators: `sizeof ( type )` / cast / compound literal / `_Generic` - / GCC statement-expression / brace initializer list / adjacent - string concatenation. - -- **Conditional-group folding** (`src/conditional-groups.ts`): a - translation-unit-level post-pass that collapses contiguous runs - of `#if`/`#ifdef` … `#elif`/`#else` … `#endif` into a single - `conditional_group` node. Self-contained — operates only on - already-parsed `conditional_directive` nodes. - -- **Hybrid dispatch + legacy fallback** (`src/structure.ts`, - `src/expr.ts`): the `external_declaration` cascading wildcard - alts dispatch to `simple_declaration` (or to typed - preprocessor / asm / static_assert sub-rules) whenever - `@looks-simple-decl` recognises the head; otherwise the chomp - loop falls through to a recursive-descent post-processor in - `structure.ts`. Shapes covered by the new path: - - simple declarations (storage prefix, multi-keyword type, - pointer / array, function declarator, function definition) - - tagged-type specifiers (struct / union / enum, including - standalone definitions and C23 fixed-underlying-type enums) - - attribute specs (GCC / MSVC / C23, leading + between-specs - insertion points) - - top-level preprocessor directives (#define, #include, #if - family, #pragma / #error / #warning / #undef / #line) - - top-level GCC `__asm__` - - all expression and statement forms - - Shapes still on the legacy path: - - K&R parameter lists (`int f(a, b) int a; long b; { … }`) — - rare in modern code; csmith never generates them - - complex compound declarators beyond simple function pointers - (`int (*arr[N])(int);` arrays-of-fn-ptrs, - `int (*(*fpp))(int);` ptr-to-fn-ptr). Plain function pointers - `int (*fp)(int);` and top-level `static_assert(cond, msg);` - moved onto the grammar path in 2.0. - - Both paths produce identical CST shapes; the - `@jsonic/expr`-driven `val` handles initializer expressions in - either case. + / `LIT_STRING` / `ID` / `TYPEDEF_NAME` / `KW_NULLPTR` / `KW_TRUE` / + `KW_FALSE`), then `@jsonic/expr`'s pratt logic drives infix / + prefix / suffix operator precedence. Custom val open-alts handle + the C constructs that aren't simple operators: `sizeof ( type )` / + cast / compound literal / `_Generic` / brace initialiser list / + adjacent string concatenation. + +The parser is grammar-only: every external declaration flows through +`simple_declaration` (or `static_assert_declaration` for top-level +`static_assert`, or — with `{extended: true}` — `preprocessor_directive` +for `#`-lines). There is no chomp fallback and no post-process +structurer. The dispatch alts at the top of `external_declaration` +accept exactly the heads the grammar can parse, and anything else is a +parse error. ## Concrete-syntax shapes @@ -111,9 +117,6 @@ fields. Highlights: ``` translation_unit - conditional_group (#if … #elif … #else … #endif folded) - branches: conditional_branch { branchKind, directive, body } - endif external_declaration { declKind: 'declaration'|'function_definition' } declaration_specifiers attribute_spec, struct_specifier, union_specifier, enum_specifier @@ -129,19 +132,25 @@ translation_unit parameter_type_list { variadic? } parameter_declaration { declaredName } identifier_list (K&R) - asm_label?, attribute_spec? '=' initializer + kr_declaration_list (K&R fn-def: flat token-refs + between the param `)` and + the body `{`) static_assert_declaration { condition, message? } define_directive { macroName, macroKind, macroParams?, macroVariadic? } - include_directive { includeForm, headerKind, headerName } - conditional_directive { directive } - pragma_directive / error_directive / warning_directive / undef_directive + undef_directive { macroName } + include_directive { headerKind, headerName } + conditional_group (#if … #elif … #else … #endif folded) + branches: conditional_branch { branchKind, directive } [] + endif: conditional_directive { directive: 'endif' } + conditional_directive { directive } (orphan / unterminated only) + pragma_directive / error_directive / warning_directive / line_directive compound_statement declaration | statement if_statement, switch_statement, while_statement, do_statement, for_statement (for_controls), labeled_statement { labelKind, labelName? }, jump_statement { jumpKind }, - expression_statement, asm_statement, preprocessor_line + expression_statement ``` ### Expression shapes (Pratt-parsed via @jsonic/expr) @@ -159,7 +168,7 @@ the per-kind CST shapes below. literal_expression { literalKind, value } identifier_expression { name } paren_expression -call_expression { callee, isMacro } +call_expression { callee } argument_list subscript_expression { target, index_list } member_expression { object, op ('.'|'->'), memberName } @@ -173,7 +182,6 @@ comma_expression generic_selection generic_controlling_expression { expression } generic_association { associationKind, typeName?, value } -statement_expression // GCC ({ ... }) compound_literal { typeName, initializer_list } initializer_list initializer_item { designation?, value } @@ -188,58 +196,45 @@ C's classic ambiguity (an identifier may name a typedef OR a variable) is resolved at lex time. The identifier matcher consults `SymbolTable.isTypedef(word)` and emits **TYPEDEF_NAME** instead of **ID** for every typedef'd name. After a `typedef int T;` declaration -finalizes, the symbol table is updated AND any pre-fetched lookahead +finalises, the symbol table is updated AND any pre-fetched lookahead tokens carrying that name are reclassified in place, so the next declaration sees the new classification regardless of jsonic's arbitrary-lookahead. -A parallel **macro table** records `#define`d names. Identifiers seen -earlier in a `#define` lex as **MACRO_NAME**, and `call_expression` -nodes carrying such a callee get `isMacro: true` so consumers can -distinguish a macro invocation from a real function call without -re-querying any table. `#undef` removes the entry. - Full **nested scoping** (file / function-prototype / function-body / block / struct-or-union / enum / for-init) is implemented in `SymbolTable`. Inner non-typedef bindings shadow outer typedefs. -## Preprocessor - -Each `#-line` is its own structured directive node (see shapes above). -A translation-unit-level post-pass folds the flat sequence of -`#if`/`#ifdef`/`#ifndef` … (`#elif`…)\* (`#else`)? … `#endif` into a -single `conditional_group` containing typed branches. Best-effort: -unmatched `#endif` or unterminated `#if` leaves the surrounding -sequence flat. Nested `#if … #endif` inside a branch is recursively -grouped. - -`#define` directives populate `ctx.meta.cmeta.macros`; `#undef` -removes. The macro table is the single source of truth used by lex-time -**MACRO_NAME** tagging. - -## Attributes (all three forms structured) +## Compound declarators ``` -attribute_spec { attributeForm: 'gcc'|'msvc'|'c23', items } - attribute_item { attributeName, attributePrefix?, argumentList? } - attribute_argument_list // Pratt-parsed args +int (*fp)(int); // pointer to function +int (**fp)(int); // pointer to pointer to function +int (*p)[10]; // pointer to array +int (*arr[3])(int); // array of fn-pointers +int (*get())[10]; // fn returning ptr-to-array +int (*(*fpp))(int); // nested paren-form +char *(*foo[3])(int); // leading-pointer type with paren-form ``` -`__attribute__((noreturn, format(printf, 1, 2)))`, -`__declspec(dllexport)`, and C23 `[[gnu::pure]]` / -`[[deprecated("reason")]]` all produce the same item shape. +`paren_inner_declarator` recurses for nested paren-forms and +dispatches `array_postfix` / `function_postfix` for inner postfixes +that bind to the inner direct_declarator. `init_declarator`'s close +accepts a paren-form after a leading pointer prefix so the +leading-pointer-type shape (`char *(*foo)(int);`) flows through +without falling into a function-postfix interpretation. -## GCC inline assembly +## K&R function definitions ``` -asm_statement { qualifiers } - asm_template { expression } - asm_outputs? asm_operand { asmName?, constraint, value { expression } } - asm_inputs? - asm_clobbers? asm_clobber { value } - asm_labels? asm_label_ref { labelName } +int f(a, b) int a; long b; { return a + b; } ``` +The identifier-list parameter list is captured by `function_postfix`'s +`identifier_list` alt; the parameter-type declarations between `)` +and `{` are absorbed by `kr_declaration_list` as a flat token-ref +sequence (no inner declaration structuring). + ## for-loop controls ``` @@ -249,87 +244,6 @@ for_controls for_iter { value: | empty } ``` -## Coverage and known limitations - -The parser handles every shape in the CSmith-generated regression -corpus (100 random C programs) plus a hand-curated stress sweep -(GCC `__attribute__`, C23 `nullptr` / `[[nodiscard]]` / `_BitInt`, -nested preprocessor `#if` chains, line-continuation in macro -bodies, function pointers, GCC inline assembly with operand -sections, struct bitfields with anonymous unions, designated and -indexed initialisers). - -Known fall-throughs that produce a `declKind: 'unknown'` external -declaration rather than a structured one (still parseable, source -fidelity preserved): - -- K&R-style parameter declarations (`int f(a, b) int a; long b; { … }`). -- GCC `__extern_inline` declarations gated on a `__USE_EXTERN_INLINES` - feature macro that hasn't been `#define`d. - -The first parse of `(struct point){ … }` (compound literal with a -struct-tagged type) inside a function body is not yet structured — -the struct-tagged type isn't in the new path's `SIMPLE_TYPE_HEAD` -set. Top-level brace initialisers on struct types (`struct point p -= { … };`) work because they go through the legacy fallback. - -## Architecture history - -The parser shipped through a 14-phase migration from a pure -chomp-and-post-process design to the current near-pure-grammar -hybrid: - -- **A** install `@jsonic/expr`; `val` accepts C atoms with the - evaluate callback emitting the public CST shapes. -- **B** `simple_declaration` family + statement family — - `block_item` / `statement` / `expression_statement` / - `jump_statement` / `if`/`while`/`do`/`switch`/`for` / - `labeled_statement` / `asm_statement` / `preprocessor_line`. -- **C** `val` open-alts for type-name constructs: - `type_name` / `sizeof_type_form` / `cast_or_compound_literal` / - `initializer_list` (with `designation` / `designator`) / - `generic_selection` / `statement_expression` / `string_atom` / - structured `asm_statement`. -- **D** cutover gates: deep-lookahead body validation - (`fetchDeep()` drives `ctx.lex` directly so the body-supportedness - check walks past the closing `}` of any function body), all - unit tests passing on the new path, csmith fixtures regenerated. - Shipped as `0.2.0`. -- **F** struct / union / enum specifiers + members + bitfields + - enumerators, dispatched from `simple_declaration` / `spec_loop`. -- **G** attribute specs (3 forms × leading + between-specs - insertion points). -- **H** top-level preprocessor directives — define / undef / - include / conditional / pragma / error / warning / line — with - macro registration on `cmeta.macros`, header-name lex-mode - feedback, and the typed sub-rules wrapped under - `external_declaration`. -- **I** top-level GCC `__asm__`. (`static_assert` grammar rule - defined; top-level dispatch deferred pending comma-op gating.) -- **K** `structureConditionalGroups` extracted to its own - module — a self-contained translation-unit-level post-pass. -- **L** standalone struct / enum definitions through grammar - (`@looks-simple-decl` walks past tagged-type bodies). -- **N** ship `1.0.0`. -- **P** parenthesised sub-declarators (function pointers): - `paren_inner_declarator` rule + `@looks-simple-decl` paren-walk - branch. Shapes like `int (*fp)(int);` and - `typedef int (*Fn)(int);` flow through the grammar. -- **O** vendor `@jsonic/expr` under `vendor/jsonic-expr/` and - add a `n.no_comma_op` bail in `val.close` / `expr.close` that - matches the comma op by src. Top-level `static_assert(cond, msg)` - dispatches into the existing `static_assert_declaration` rule - with the flag set, so the `,` lands as a separator instead of - the comma operator. -- **N₂** ship `2.0.0` declaring the hybrid as the final - architecture. - -The legacy chomp + `structureExternalDeclaration` fallback -remains by design for the long-tail shapes — K&R parameter lists -and complex compound declarators beyond simple function pointers. -Both paths emit identical CST nodes, so consumers see one tree -regardless of which path produced it. - ## License MIT. Copyright (c) 2026 Richard Rodger and contributors. diff --git a/c-grammar.jsonic b/c-grammar.jsonic index fe7acb4..29ddc13 100644 --- a/c-grammar.jsonic +++ b/c-grammar.jsonic @@ -42,6 +42,17 @@ # before deciding to recurse. extdecl_loop: { open: [ + # [extension: preprocessor] When the next token is a + # `#if`/`#ifdef`/`#ifndef` directive, fold the run up to + # the matching `#endif` into a conditional_group. Stray + # `#elif`/`#else`/`#endif` outside any group fall through + # to the regular external_declaration path and stay flat. + # The 2-token lookahead `s:` forces both `#` and the + # directive-name token to be fetched; b: 2 backsteps so + # conditional_group's open re-takes them via + # conditional_directive. + { s: 'PP_HASH #ANY_C_TOKEN' c: '@cg-head-is-if-family' b: 2 + p: 'conditional_group' g: 'loop-cg' } { p: 'external_declaration' g: 'loop-one' } ] close: [ @@ -50,96 +61,144 @@ ] } + # ---- conditional_group ------------------------------------------ + # + # Folds a run of `#if`/`#ifdef`/`#ifndef` … (`#elif`/`#elifdef`/ + # `#elifndef`/`#else`)? * `#endif` into a single conditional_group + # node. Each branch is a conditional_branch carrying its directive + # and body (a sequence of external_declarations and nested + # conditional_groups). Pure grammar — no post-pass. + # + # State on rule.k: + # curBranch — current conditional_branch under construction + # branchOpen — true after a directive opens a branch (waiting + # for its body to absorb) + # bodyTaken — true after cg_branch_body returned for current + # branch (waiting for next directive) + # endifTaken — true once #endif has been consumed; rule exits + conditional_group: { + open: [ + # Already initialised on r:-recursion: skip and run close. + { c: '@cg-reentered' s: [] g: 'cg-reentry' } + # Take the leading #if-family directive as the first branch's + # opener. @conditional_group-bc detects the returning child + # and opens a new conditional_branch. The 2-token `s:` with + # b: 2 force-fetches both tokens so the cond can peek at the + # directive name; conditional_directive then re-takes them. + { s: 'PP_HASH #ANY_C_TOKEN' c: '@cg-head-is-if-family' b: 2 + p: 'conditional_directive' g: 'cg-open-first' } + ] + close: [ + # Group cleanly closed: rule exits. + { c: '@cg-completed' s: [] g: 'cg-end' } + # Just opened a branch — absorb its body until the next + # boundary directive. + { c: '@cg-need-body' p: 'cg_branch_body' g: 'cg-body' } + # Body absorbed; next is #elif / #elifdef / #elifndef / #else. + { s: 'PP_HASH #ANY_C_TOKEN' c: '@cg-next-is-elif-or-else' b: 2 + p: 'conditional_directive' r: 'conditional_group' + g: 'cg-advance' } + # Body absorbed; next is #endif. Take it and complete. + { s: 'PP_HASH #ANY_C_TOKEN' c: '@cg-next-is-endif' b: 2 + p: 'conditional_directive' g: 'cg-take-endif' } + # Malformed: EOF before #endif, or stray non-directive token. + # Bail — exit with a partial group; the (potentially + # incomplete) node is still pushed onto extdecl_loop.children. + { s: '#ZZ' b: 1 a: '@cg-bail' g: 'cg-bail-eof' } + { s: [] a: '@cg-bail' g: 'cg-bail-fall' } + ] + } + + # ---- cg_branch_body --------------------------------------------- + # + # Absorbs external_declarations (and nested conditional_groups) + # into the parent conditional_group's curBranch.children. Stops + # without consuming when the next token is a boundary directive + # (#elif / #elifdef / #elifndef / #else / #endif) — the parent + # rule's close picks it up. + cg_branch_body: { + open: [ + # Empty body or boundary at start: exit immediately. The 2- + # token `s: PP_HASH #ANY_C_TOKEN` force-fetches the lookahead + # so the cond can peek at the directive name; b: 2 backsteps + # both so the parent rule sees them. + { s: 'PP_HASH #ANY_C_TOKEN' c: '@cgb-at-boundary' b: 2 + g: 'cgb-empty' } + { s: '#ZZ' b: 1 g: 'cgb-empty-eof' } + # Nested #if-family → recurse into a fresh conditional_group. + { s: 'PP_HASH #ANY_C_TOKEN' c: '@cgb-at-nested-if' b: 2 + p: 'conditional_group' g: 'cgb-first-nested' } + # First body item. + { p: 'external_declaration' g: 'cgb-first' } + ] + close: [ + { s: 'PP_HASH #ANY_C_TOKEN' c: '@cgb-at-boundary' b: 2 + g: 'cgb-end' } + { s: '#ZZ' b: 1 g: 'cgb-eof' } + { s: 'PP_HASH #ANY_C_TOKEN' c: '@cgb-at-nested-if' b: 2 + p: 'conditional_group' r: 'cg_branch_body' g: 'cgb-nested' } + { p: 'external_declaration' r: 'cg_branch_body' g: 'cgb-more' } + ] + } + # external_declaration # - # Phase B1 dispatch: if the head token is a recognised simple type - # specifier (currently only KW_INT, broadens later), descend into - # int_declaration which parses through proper grammar (with val - # for initializers via @jsonic/expr). Otherwise fall through to - # the legacy chomp path that absorbs tokens for post-process - # structuring. + # The head token's class picks the dispatch: + # - KW_STATIC_ASSERT / KW__STATIC_ASSERT → static_assert_declaration + # - KW__BITINT, #STORAGE_PREFIX, #SIMPLE_TYPE_HEAD, `[[` → simple_declaration + # Anything else fails — plain C has no chomp fallback. external_declaration: { open: [ { s: '#ZZ' b: 1 g: 'extdecl-eof' } - # [extension: preprocessor] PP_HASH dispatches to preprocessor_directive. + # [extension: preprocessor] PP_HASH dispatches to the typed + # preprocessor_directive sub-rule, which routes by directive + # name. Two-token lookahead force-fetches both `#` and the + # directive-name token. Gated on @ext-and-first-iter so the + # alts are inert in plain mode (and the underlying rules are + # also stripped from the spec when extended is false). { s: 'PP_HASH PP_HASH' c: '@ext-and-first-iter' b: 2 - p: 'preprocessor_directive' a: '@mark-new-path' - g: 'extdecl-pp-2' } + p: 'preprocessor_directive' g: 'extdecl-pp-2' } { s: 'PP_HASH #ANY_C_TOKEN' c: '@ext-and-first-iter' b: 2 - p: 'preprocessor_directive' a: '@mark-new-path' - g: 'extdecl-pp' } - # Phase O: top-level static_assert dispatches into the + p: 'preprocessor_directive' g: 'extdecl-pp' } + # Top-level static_assert dispatches into the # static_assert_declaration grammar rule. The cond / msg # vals are pushed with n.no_comma_op set so the vendored # @jsonic/expr's expr.close bails at `,` rather than # treating it as the comma operator. { s: 'KW_STATIC_ASSERT' c: '@is-first-iter' b: 1 - p: 'static_assert_declaration' a: '@mark-new-path' - g: 'extdecl-sa' } + p: 'static_assert_declaration' g: 'extdecl-sa' } { s: 'KW__STATIC_ASSERT' c: '@is-first-iter' b: 1 - p: 'static_assert_declaration' a: '@mark-new-path' - g: 'extdecl-sa-1' } + p: 'static_assert_declaration' g: 'extdecl-sa-1' } # [extension: gcc-asm] top-level inline assembly block. { s: 'KW_ASM' c: '@ext-and-first-iter' b: 1 - p: 'asm_statement' a: '@mark-new-path' - g: 'extdecl-asm' } + p: 'asm_statement' g: 'extdecl-asm' } { s: 'KW___ASM' c: '@ext-and-first-iter' b: 1 - p: 'asm_statement' a: '@mark-new-path' - g: 'extdecl-asm-1' } + p: 'asm_statement' g: 'extdecl-asm-1' } { s: 'KW___ASM__' c: '@ext-and-first-iter' b: 1 - p: 'asm_statement' a: '@mark-new-path' - g: 'extdecl-asm-2' } - # Plain-mode direct dispatches. When the head token clearly - # starts a declaration, push simple_declaration without a - # lookahead validator — the rule's own alts and per-rule k - # state disambiguate the actual shape. These run only when - # `extended: false` so we don't bypass the wildcard alts' - # @looks-simple-decl + isFunctionBodySupported gate that - # routes asm-body / pp-line function definitions to the - # legacy structuring path (those constructs only matter in - # extended mode anyway). - { s: '#SIMPLE_TYPE_HEAD' c: '@plain-and-first-iter' b: 1 - p: 'simple_declaration' a: '@mark-new-path' - g: 'extdecl-plain-type' } - { s: '#STORAGE_PREFIX' c: '@plain-and-first-iter' b: 1 - p: 'simple_declaration' a: '@mark-new-path' - g: 'extdecl-plain-storage' } - { s: 'KW__BITINT' c: '@plain-and-first-iter' b: 1 - p: 'simple_declaration' a: '@mark-new-path' - g: 'extdecl-plain-bitint' } - { s: 'PUNC_LBRACKET PUNC_LBRACKET' c: '@plain-as23-and-first' - b: 2 p: 'simple_declaration' a: '@mark-new-path' - g: 'extdecl-plain-c23-attr' } - # Phase B2.3 dispatch: cascading wildcard-token alts. Each one - # matches a fixed number of tokens to force lookahead, then the - # @looks-simple-decl cond validates the actual shape — optional - # storage prefix, 1+ simple type specifiers, an ID, and a `;` or - # `=` terminator. b: N back-steps all matched tokens so - # simple_declaration sees them as t0..t(N-1). - # Longest alts first so multi-keyword forms win over shorter - # shapes that would have stopped at the wrong ID. - # Gate: only on the first iteration of an external_declaration - # so the chomp's r:-recursion doesn't re-fire mid-declaration. - { s: '#ANY_C_TOKEN #ANY_C_TOKEN #ANY_C_TOKEN #ANY_C_TOKEN #ANY_C_TOKEN #ANY_C_TOKEN' - c: '@looks-simple-decl' b: 6 - p: 'simple_declaration' a: '@mark-new-path' g: 'extdecl-new-decl-6' } - { s: '#ANY_C_TOKEN #ANY_C_TOKEN #ANY_C_TOKEN #ANY_C_TOKEN #ANY_C_TOKEN' - c: '@looks-simple-decl' b: 5 - p: 'simple_declaration' a: '@mark-new-path' g: 'extdecl-new-decl-5' } - { s: '#ANY_C_TOKEN #ANY_C_TOKEN #ANY_C_TOKEN #ANY_C_TOKEN' - c: '@looks-simple-decl' b: 4 - p: 'simple_declaration' a: '@mark-new-path' g: 'extdecl-new-decl-4' } - { s: '#ANY_C_TOKEN #ANY_C_TOKEN #ANY_C_TOKEN' - c: '@looks-simple-decl' b: 3 - p: 'simple_declaration' a: '@mark-new-path' g: 'extdecl-new-decl-3' } - { s: '#ANY_C_TOKEN' a: '@absorb-token' g: 'extdecl-tok' } - ] - close: [ - { c: '@new-path' a: '@finalize-new-path' g: 'extdecl-new-end' } - { s: '#ZZ' b: 1 a: '@finalize-extdecl' g: 'extdecl-finish-eof' } - { c: '@just-closed-and-decl-ahead' a: '@finalize-extdecl' g: 'extdecl-finish-block' } - { c: '@terminated' a: '@finalize-extdecl' g: 'extdecl-finish' } - { r: 'external_declaration' g: 'extdecl-more' } + p: 'asm_statement' g: 'extdecl-asm-2' } + # Direct dispatches. When the head token clearly starts a + # declaration, descend into simple_declaration; the rule's own + # alts and per-rule k state disambiguate the actual shape. + { s: '#SIMPLE_TYPE_HEAD' c: '@is-first-iter' b: 1 + p: 'simple_declaration' g: 'extdecl-type' } + { s: '#STORAGE_PREFIX' c: '@is-first-iter' b: 1 + p: 'simple_declaration' g: 'extdecl-storage' } + # [extension: gcc-attr] leading GCC __attribute__((…)) dispatches + # into simple_declaration so its open's leading-attr alt picks it up. + { s: 'KW___ATTRIBUTE__' c: '@ext-and-first-iter' b: 1 + p: 'simple_declaration' g: 'extdecl-attr-gcc' } + { s: 'KW___ATTRIBUTE' c: '@ext-and-first-iter' b: 1 + p: 'simple_declaration' g: 'extdecl-attr-gcc-1' } + # [extension: msvc-attr] leading __declspec(…) + { s: 'KW___DECLSPEC' c: '@ext-and-first-iter' b: 1 + p: 'simple_declaration' g: 'extdecl-attr-msvc' } + { s: 'KW__BITINT' c: '@is-first-iter' b: 1 + p: 'simple_declaration' g: 'extdecl-bitint' } + { s: 'PUNC_LBRACKET PUNC_LBRACKET' c: '@as23-and-first' + b: 2 p: 'simple_declaration' g: 'extdecl-c23-attr' } + ] + close: [ + { a: '@finalize-new-path' g: 'extdecl-end' } ] } @@ -162,7 +221,7 @@ # has wired up for full C precedence). simple_declaration: { open: [ - # Leading C23 attribute spec — plain C23. + # Leading C23 attribute spec. { s: 'PUNC_LBRACKET PUNC_LBRACKET' c: '@as23-adjacent-open' b: 2 p: 'spec_loop' g: 'simple-decl-attr-c23' } # [extension: gcc-attr] leading GCC __attribute__((…)) spec. @@ -205,6 +264,18 @@ { s: 'PUNC_LBRACE' b: 1 p: 'compound_statement' a: '@simple-decl-start-fn-body' g: 'simple-decl-fn-body' } + # K&R-style function definition: between the `)` of an + # identifier-list parameter list and the `{` of the body, + # a sequence of parameter declarations may appear (`int + # f(a, b) int a; long b; { … }`). Descend into + # kr_declaration_list which absorbs flat token-refs until + # `{`; on return the LBRACE alt above picks up the body. + # Gated on @kr-not-yet so we don't re-fire after the list + # has already been attached. + { s: '#SIMPLE_TYPE_HEAD' c: '@kr-not-yet' b: 1 + p: 'kr_declaration_list' g: 'simple-decl-kr-type' } + { s: '#STORAGE_PREFIX' c: '@kr-not-yet' b: 1 + p: 'kr_declaration_list' g: 'simple-decl-kr-storage' } # First declarator (after specs). Backstep the head token so # init_declarator's open sees it; descend into the sub-rule. # ID head: plain declarator. STAR head: pointer prefix. @@ -238,9 +309,8 @@ # owning declaration_specifiers list. spec_loop: { open: [ - # Attribute specs interleave freely with simple specifiers and - # tagged-type heads. C23 [[…]] is plain; GCC __attribute__ / - # __attribute / MSVC __declspec are extensions. + # C23 attribute spec [[…]] interleaves with simple specifiers + # and tagged-type heads. { s: 'PUNC_LBRACKET PUNC_LBRACKET' c: '@as23-adjacent-open' b: 2 p: 'attribute_spec_c23' g: 'spec-loop-attr-c23' } # [extension: gcc-attr] @@ -271,7 +341,6 @@ { s: [] g: 'spec-loop-empty' } ] close: [ - # See open above for the plain-vs-extension split. { s: 'PUNC_LBRACKET PUNC_LBRACKET' c: '@as23-adjacent-open' b: 2 p: 'attribute_spec_c23' g: 'spec-loop-more-attr-c23' } # [extension: gcc-attr] @@ -282,8 +351,13 @@ # [extension: msvc-attr] { s: 'KW___DECLSPEC' c: '@extended-on' b: 1 p: 'attribute_spec_msvc' g: 'spec-loop-more-attr-msvc' } - # Tagged-type heads must come before #SIMPLE_TYPE_HEAD here - # too (see open above for rationale). + # Storage prefix: when a leading attribute spec routed into + # spec_loop ahead of any storage keyword (e.g. + # `__attribute__((unused)) static int q;`), the storage class + # has to be absorbed here rather than at the simple_declaration + # level. + { s: '#STORAGE_PREFIX' a: '@absorb-spec-storage' + r: 'spec_loop' g: 'spec-loop-more-storage' } { s: 'KW_STRUCT' b: 1 p: 'struct_specifier' g: 'spec-loop-more-struct' } { s: 'KW_UNION' b: 1 p: 'struct_specifier' g: 'spec-loop-more-union' } { s: 'KW_ENUM' b: 1 p: 'enum_specifier' g: 'spec-loop-more-enum' } @@ -346,6 +420,20 @@ # Returning from pointer_list, capture the ID, then re-enter # to check for postfix / initializer. { s: 'ID' a: '@idecl-name' r: 'init_declarator' g: 'idecl-id-after-ptrs' } + # Leading-pointer-type with paren-form declarator + # (`char *(*foo)(int);`): after pointer_list returned, we see + # `(` instead of an ID. Treat it as a paren-form sub-declarator + # and descend into paren_inner_declarator. Gated on !named so + # this doesn't fire for the function-postfix `(` after an ID. + { s: 'PUNC_LPAREN' c: '@idecl-not-named' a: '@idecl-paren-open' + p: 'paren_inner_declarator' g: 'idecl-paren-after-ptrs' } + # Post-declarator C23 [[…]] attribute. MUST come before the + # single-token `[` array_postfix alt below so the 2-token + # lookahead wins on `[[` (the @idecl-named-and-as23 condition + # also enforces token adjacency). + { s: 'PUNC_LBRACKET PUNC_LBRACKET' c: '@idecl-named-and-as23' + b: 2 p: 'attribute_spec_c23' r: 'init_declarator' + g: 'idecl-post-attr-c23' } # Array postfix `[ … ]` (one or more dimensions). Each one # re-enters init_declarator so additional postfixes can stack. { s: 'PUNC_LBRACKET' b: 1 p: 'array_postfix' @@ -356,6 +444,23 @@ # only exercises ` ID ( … ) ;`. { s: 'PUNC_LPAREN' b: 1 p: 'function_postfix' r: 'init_declarator' g: 'idecl-fn' } + # Post-declarator GCC __attribute__ / MSVC __declspec. Fire + # after the declarator has been named, so they don't compete + # with leading-attribute alts at the simple_declaration level. + # Each alt re-enters init_declarator so multiple post- + # declarator attributes can chain, and so the trailing `=` + # initializer or `,` / `;` terminator can still be picked up. + # `int x __attribute__((aligned(8)));` + # `void f(void) __attribute__((noreturn));` + { s: 'KW___ATTRIBUTE__' c: '@idecl-named-and-extended' b: 1 + p: 'attribute_spec_gcc' r: 'init_declarator' + g: 'idecl-post-attr-gcc' } + { s: 'KW___ATTRIBUTE' c: '@idecl-named-and-extended' b: 1 + p: 'attribute_spec_gcc' r: 'init_declarator' + g: 'idecl-post-attr-gcc-1' } + { s: 'KW___DECLSPEC' c: '@idecl-named-and-extended' b: 1 + p: 'attribute_spec_msvc' r: 'init_declarator' + g: 'idecl-post-attr-msvc' } { s: 'PUNC_ASSIGN' p: 'initializer' a: '@idecl-take-eq' g: 'idecl-eq' } { s: [] g: 'idecl-end' } ] @@ -396,8 +501,32 @@ # After pointer_list returns, capture the ID then re-enter. { s: 'ID' a: '@pid-name' r: 'paren_inner_declarator' g: 'pid-id-after-ptrs' } + # Nested paren-form: `int (*(*fpp))(int)` — after the + # pointer_list returned, another `(` opens a deeper inner + # declarator. Push `(` onto our direct_declarator and + # recurse; the matching `)` is consumed by the rparen alt + # below (gated on @pid-paren-pending). + { s: 'PUNC_LPAREN' c: '@pid-not-named' a: '@pid-paren-open' + p: 'paren_inner_declarator' g: 'pid-nested' } + # Returning from a nested paren_inner_declarator: consume + # the matching `)`, mark named/parenClosed, and recurse so + # any inner postfix (`[…]` / `(…)`) can still attach. + { s: 'PUNC_RPAREN' c: '@pid-paren-pending' a: '@pid-paren-close' + r: 'paren_inner_declarator' g: 'pid-nested-rparen' } + # Inner array postfix on the just-named declarator: + # `int (*arr[3])(int);` — the `[3]` belongs to the inner + # direct_declarator, not the outer one. array_postfix + # attaches to rule.parent.k.directDeclarator which is our + # inner direct_declarator at this point. + { s: 'PUNC_LBRACKET' c: '@pid-named' b: 1 p: 'array_postfix' + r: 'paren_inner_declarator' g: 'pid-arr' } + # Inner function postfix on the just-named declarator: + # `int (*get())[10];` — the inner `()` is a function postfix. + { s: 'PUNC_LPAREN' c: '@pid-named' b: 1 p: 'function_postfix' + r: 'paren_inner_declarator' g: 'pid-fn' } # Stop before `)` so the outer init_declarator's close can - # consume it. + # consume it (only when this PID isn't itself paren-pending, + # handled by the priority of the @pid-paren-pending alt above). { s: 'PUNC_RPAREN' b: 1 g: 'pid-end-rparen' } { s: [] g: 'pid-end' } ] @@ -500,6 +629,31 @@ ] } + # kr_declaration_list: the K&R parameter-declaration list that + # appears between the closing `)` of an identifier-list parameter + # list and the opening `{` of the function body. The legacy + # post-process produced a flat node containing the raw token refs + # (no inner declaration structuring); we mirror that shape so + # consumers see the same CST regardless of which path fired. + # Open absorbs the first token; close keeps absorbing until `{`. + kr_declaration_list: { + open: [ + # Defensive: empty list (caller already guarded). + { s: 'PUNC_LBRACE' b: 1 g: 'kr-empty' } + { s: '#ZZ' b: 1 g: 'kr-eof' } + { s: '#ANY_C_TOKEN' a: '@kr-take' g: 'kr-first' } + ] + close: [ + # Stop at the body-opening `{` (don't consume — let + # simple_declaration's PUNC_LBRACE alt drive + # compound_statement). + { s: 'PUNC_LBRACE' b: 1 g: 'kr-end' } + { s: '#ZZ' b: 1 g: 'kr-end-eof' } + { s: '#ANY_C_TOKEN' a: '@kr-take' r: 'kr_declaration_list' + g: 'kr-more' } + ] + } + # parameter_type_list: 1+ comma-separated parameter_declarations, # optionally terminated by `, ...` for variadic functions. parameter_type_list: { @@ -692,8 +846,7 @@ g: 'stmt-asm-1' } { s: 'KW___ASM__' c: '@extended-on' b: 1 p: 'asm_statement' g: 'stmt-asm-2' } - # [extension: preprocessor] preprocessor line inside a body - # (rare but legal). + # [extension: preprocessor] `#`-line inside a function body. { s: 'PP_HASH' c: '@extended-on' b: 1 p: 'preprocessor_line' g: 'stmt-pp' } # Expression statement (default fallthrough) @@ -898,158 +1051,6 @@ ] } - # ---- asm_statement (phase B4.2.4, opaque-token form) ------------ - # - # GCC inline asm: `__asm__ volatile? goto? ( template : … ) ;`. - # Phase B4.2.4 captures the whole statement as a flat token-list - # under an asm_statement node — qualifiers / template / operand - # sections are NOT yet broken out (that's a follow-up). The shape - # is enough to unblock the body-supportedness gate. - # asm_statement (phase C.8 — structured form): - # * (