Skip to content

Commit 3e50cf3

Browse files
committed
refactor other functionalities from ast parser into separate classes
1 parent 4e9176e commit 3e50cf3

7 files changed

Lines changed: 443 additions & 465 deletions

File tree

AGENTS.md

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -25,12 +25,13 @@ This file contains important information about the sql-metadata repository for A
2525
sql-metadata/
2626
├── sql_metadata/ # Main package
2727
│ ├── parser.py # Public facade — Parser class
28-
│ ├── ast_parser.py # ASTParser — SQL preprocessing, AST construction
28+
│ ├── ast_parser.py # ASTParser — thin orchestrator, composes SqlCleaner + DialectParser
29+
│ ├── sql_cleaner.py # SqlCleaner — raw SQL preprocessing (no sqlglot dependency)
30+
│ ├── dialect_parser.py # DialectParser — dialect detection, parsing, quality validation
2931
│ ├── column_extractor.py # ColumnExtractor — single-pass DFS column/alias extraction
3032
│ ├── table_extractor.py # TableExtractor — table extraction with position sorting
3133
│ ├── nested_resolver.py # NestedResolver — CTE/subquery names, bodies, resolution
3234
│ ├── query_type_extractor.py # QueryTypeExtractor — query type detection
33-
│ ├── dialects.py # Custom sqlglot dialects and detection heuristics
3435
│ ├── comments.py # Comment extraction/stripping (pure functions)
3536
│ ├── keywords_lists.py # QueryType/TokenType enums, keyword sets
3637
│ ├── utils.py # UniqueList, flatten_list, shared helpers
@@ -54,8 +55,9 @@ The v3 architecture uses sqlglot to build an AST, then walks it with specialised
5455
### Pipeline
5556

5657
```
57-
Raw SQL → ASTParser (preprocessing, dialect detection, sqlglot.parse())
58-
→ sqlglot AST
58+
Raw SQL → SqlCleaner (preprocessing)
59+
→ DialectParser (dialect detection, sqlglot.parse())
60+
→ sqlglot AST (cached by ASTParser)
5961
→ TableExtractor (tables, table aliases)
6062
→ ColumnExtractor (columns, column aliases — single-pass DFS)
6163
→ NestedResolver (CTE/subquery names + bodies, column resolution)
@@ -75,7 +77,9 @@ Raw SQL → ASTParser (preprocessing, dialect detection, sqlglot.parse())
7577
| Class | Owns | Does NOT own |
7678
|-------|------|-------------|
7779
| `Parser` | Facade, caching, regex fallbacks, value extraction | No extraction logic |
78-
| `ASTParser` | Preprocessing, AST construction | No metadata extraction |
80+
| `ASTParser` | Orchestration, lazy AST caching | No preprocessing, no parsing |
81+
| `SqlCleaner` | Raw SQL preprocessing (REPLACE rewrite, comment strip, CTE normalisation) | No AST, no sqlglot |
82+
| `DialectParser` | Dialect detection, sqlglot parsing, parse-quality validation | No preprocessing |
7983
| `ColumnExtractor` | Column names, column aliases (during DFS walk) | CTE/subquery name extraction (standalone) |
8084
| `TableExtractor` | Table names, table aliases, position sorting | Nothing else |
8185
| `NestedResolver` | CTE/subquery names, CTE/subquery bodies, column resolution | Column extraction |

ARCHITECTURE.md

Lines changed: 42 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -7,12 +7,13 @@ sql-metadata v3 is a Python library that parses SQL queries and extracts metadat
77
| Module | Role | Key Class/Function |
88
|--------|------|--------------------|
99
| [`parser.py`](sql_metadata/parser.py) | Public facade — composes all extractors via lazy properties | `Parser` |
10-
| [`ast_parser.py`](sql_metadata/ast_parser.py) | SQL preprocessing, dialect detection, AST construction | `ASTParser` |
10+
| [`ast_parser.py`](sql_metadata/ast_parser.py) | Thin orchestrator — composes SqlCleaner + DialectParser, caches AST | `ASTParser` |
11+
| [`sql_cleaner.py`](sql_metadata/sql_cleaner.py) | Raw SQL preprocessing (no sqlglot dependency) | `SqlCleaner`, `CleanResult` |
12+
| [`dialect_parser.py`](sql_metadata/dialect_parser.py) | Dialect detection, sqlglot parsing, parse-quality validation | `DialectParser`, `HashVarDialect`, `BracketedTableDialect` |
1113
| [`column_extractor.py`](sql_metadata/column_extractor.py) | Single-pass DFS column/alias extraction | `ColumnExtractor` |
1214
| [`table_extractor.py`](sql_metadata/table_extractor.py) | Table extraction with position-based sorting | `TableExtractor` |
1315
| [`nested_resolver.py`](sql_metadata/nested_resolver.py) | CTE/subquery name and body extraction, nested column resolution | `NestedResolver` |
1416
| [`query_type_extractor.py`](sql_metadata/query_type_extractor.py) | Query type detection from AST root node | `QueryTypeExtractor` |
15-
| [`dialects.py`](sql_metadata/dialects.py) | Custom sqlglot dialects and dialect detection heuristics | `HashVarDialect`, `BracketedTableDialect`, `detect_dialects` |
1617
| [`comments.py`](sql_metadata/comments.py) | Comment extraction/stripping via tokenizer gaps | `extract_comments`, `strip_comments` |
1718
| [`keywords_lists.py`](sql_metadata/keywords_lists.py) | Keyword sets, `QueryType` and `TokenType` enums ||
1819
| [`utils.py`](sql_metadata/utils.py) | `UniqueList` (deduplicating list), `flatten_list`, `_make_reverse_cte_map` ||
@@ -28,10 +29,9 @@ flowchart TB
2829
2930
subgraph AST_CONSTRUCTION["ASTParser (ast_parser.py)"]
3031
direction TB
31-
PP["Preprocessing"]
32-
DD["Dialect Detection\n(dialects.py)"]
33-
SG["sqlglot.parse()"]
34-
PP --> DD --> SG
32+
PP["SqlCleaner\n(sql_cleaner.py)"]
33+
DP["DialectParser\n(dialect_parser.py)"]
34+
PP --> DP
3535
end
3636
3737
SQL --> AST_CONSTRUCTION
@@ -130,15 +130,21 @@ def tables(self) -> List[str]:
130130

131131
---
132132

133-
### ASTParser — SQL to AST
133+
### ASTParser — Orchestrator
134134

135135
**File:** [`ast_parser.py`](sql_metadata/ast_parser.py) | **Class:** `ASTParser`
136136

137-
Wraps `sqlglot.parse()` with preprocessing, dialect auto-detection, and multi-dialect retry. Instantiated once per `Parser` — actual parsing is deferred until `.ast` is first accessed.
137+
Thin orchestrator that composes `SqlCleaner` and `DialectParser`. Instantiated once per `Parser` — actual parsing is deferred until `.ast` is first accessed. Exposes `.ast`, `.dialect`, `.is_replace`, and `.cte_name_map` properties.
138138

139-
#### Preprocessing pipeline
139+
---
140+
141+
### SqlCleaner — Raw SQL Preprocessing
140142

141-
`_preprocess_sql` applies six steps in order:
143+
**File:** [`sql_cleaner.py`](sql_metadata/sql_cleaner.py) | **Class:** `SqlCleaner`
144+
145+
Pure string transformations with no sqlglot dependency. `SqlCleaner.clean(sql)` returns a `CleanResult` namedtuple with the cleaned SQL, `is_replace` flag, and CTE name map.
146+
147+
#### Preprocessing pipeline
142148

143149
```mermaid
144150
flowchart LR
@@ -158,35 +164,22 @@ flowchart LR
158164
| DB2 isolation clauses | Removes trailing `WITH UR/CS/RS/RR` | `SELECT 1 WITH UR``SELECT 1` |
159165
| Outer paren stripping | sqlglot can't parse `((UPDATE ...))` | `((UPDATE t SET x=1))``UPDATE t SET x=1` |
160166

161-
#### Dialect detection
162-
163-
Dialect detection is handled by `detect_dialects()` in [`dialects.py`](sql_metadata/dialects.py). See the [Dialects](#dialects) section below.
164-
165-
#### Multi-dialect retry
166-
167-
`_try_parse_dialects` iterates through the dialect list. For each dialect:
168-
169-
1. Parse with `sqlglot.parse()` (warnings suppressed)
170-
2. Check for degradation via `_is_degraded_result` — phantom tables (`IGNORE`, `""`), keyword-as-column names (`UNIQUE`, `DISTINCT`)
171-
3. If degraded and not the last dialect, try the next one
172-
4. If all fail, raise `ValueError("This query is wrong")`
173-
174167
---
175168

176-
### Dialects
169+
### DialectParser — Dialect Detection and Parsing
177170

178-
**File:** [`dialects.py`](sql_metadata/dialects.py)
171+
**File:** [`dialect_parser.py`](sql_metadata/dialect_parser.py) | **Class:** `DialectParser`
179172

180-
Contains custom sqlglot dialect classes and the heuristic dialect detection function.
173+
Combines dialect heuristics, `sqlglot.parse()` calls, and parse-quality validation. `DialectParser().parse(clean_sql)` returns `(ast, dialect)`.
181174

182-
**Custom dialects:**
175+
**Custom dialects (defined in same file):**
183176

184177
- `HashVarDialect` — treats `#` as part of identifiers for MSSQL temp tables (`#temp`) and template variables (`#VAR#`)
185178
- `BracketedTableDialect` — TSQL subclass for `[bracket]` quoting; also signals `TableExtractor` to preserve brackets in output
186179

187-
**Detection function:**
180+
#### Dialect detection
188181

189-
`detect_dialects(sql)` inspects the SQL for syntax hints and returns an ordered list of dialects to try:
182+
`_detect_dialects(sql)` inspects the SQL for syntax hints and returns an ordered list of dialects to try:
190183

191184
```mermaid
192185
flowchart TD
@@ -204,6 +197,15 @@ flowchart TD
204197
LV -->|No| DF["[None, mysql]"]
205198
```
206199

200+
#### Multi-dialect retry
201+
202+
`_try_dialects` iterates through the dialect list. For each dialect:
203+
204+
1. Parse with `sqlglot.parse()` (warnings suppressed)
205+
2. Check for degradation via `_is_degraded` — phantom tables (`IGNORE`, `""`), keyword-as-column names (`UNIQUE`, `DISTINCT`)
206+
3. If degraded and not the last dialect, try the next one
207+
4. If all fail, raise `ValueError("This query is wrong")`
208+
207209
---
208210

209211
### ColumnExtractor — columns and aliases
@@ -460,9 +462,9 @@ sequenceDiagram
460462
Note over Parser: Need AST and table_aliases
461463
462464
Parser->>ASTParser: .ast (first access)
463-
ASTParser->>ASTParser: _preprocess_sql()
465+
ASTParser->>ASTParser: SqlCleaner.clean()
464466
Note over ASTParser: No REPLACE, no comments,<br/>no qualified CTEs
465-
ASTParser->>ASTParser: detect_dialects()
467+
ASTParser->>ASTParser: DialectParser().parse()
466468
Note over ASTParser: No special syntax →<br/>[None, "mysql"]
467469
ASTParser->>sqlglot: sqlglot.parse(sql, dialect=None)
468470
sqlglot-->>ASTParser: exp.Select AST
@@ -498,7 +500,7 @@ sequenceDiagram
498500
1. **`Parser.__init__`** — stored raw SQL, created `ASTParser` (lazy)
499501
2. **`.columns_aliases`** accessed → triggers `.columns` (not cached)
500502
3. **`.columns`** needs the AST → accesses `self._ast_parser.ast`
501-
4. **`ASTParser.ast`** (first access) → runs `_preprocess_sql``detect_dialects``sqlglot.parse()`
503+
4. **`ASTParser.ast`** (first access) → `SqlCleaner.clean()``DialectParser().parse()``sqlglot.parse()`
502504
5. **`.tables_aliases`** needed for column extraction → `TableExtractor.extract_aliases()``{}` (no aliases on `t`)
503505
6. **`ColumnExtractor(ast, {}, {}).extract()`** → DFS walk:
504506
- Visits `Select` node, key `"expressions"``_handle_select_exprs()`
@@ -526,12 +528,13 @@ flowchart TB
526528
P --> KW["keywords_lists.py"]
527529
P --> UT["utils.py"]
528530
529-
AST --> COM
530-
AST --> DIA["dialects.py"]
531-
AST -.->|"sqlglot.parse()"| SG["sqlglot"]
531+
AST --> SC["sql_cleaner.py"]
532+
AST --> DP["dialect_parser.py"]
532533
533-
DIA --> COM
534-
TAB --> DIA
534+
SC --> COM
535+
DP --> COM
536+
DP -.->|"sqlglot.parse()"| SG["sqlglot"]
537+
TAB --> DP
535538
536539
EXT -.-> SG
537540
EXT --> UT
@@ -555,11 +558,11 @@ Note the circular dependency: `nested_resolver.py` imports `Parser` from `parser
555558

556559
**Lazy evaluation with caching** — every `Parser` property computes on first access and caches the result. This means you pay zero cost for properties you never access.
557560

558-
**Composition over inheritance**`Parser` doesn't subclass anything meaningful. It composes `ASTParser`, `TableExtractor`, `ColumnExtractor`, `NestedResolver`, and `QueryTypeExtractor` as separate concerns.
561+
**Composition over inheritance**`Parser` doesn't subclass anything meaningful. It composes `ASTParser` (which itself composes `SqlCleaner` and `DialectParser`), `TableExtractor`, `ColumnExtractor`, `NestedResolver`, and `QueryTypeExtractor` as separate concerns.
559562

560563
**Single-pass DFS extraction**`ColumnExtractor` walks the AST exactly once in `arg_types` key order. Because sqlglot's `arg_types` keys are ordered to mirror left-to-right SQL text, the walk naturally processes clauses in source order.
561564

562-
**Multi-dialect retry with degradation detection** — rather than guessing one dialect, `ASTParser` tries several in order and picks the first that doesn't produce a degraded result (phantom tables, keyword-as-column names).
565+
**Multi-dialect retry with degradation detection** — rather than guessing one dialect, `DialectParser` tries several in order and picks the first that doesn't produce a degraded result (phantom tables, keyword-as-column names).
563566

564567
**Graceful regex fallbacks** — when the AST parse fails entirely, the parser degrades to regex-based extraction for columns (INSERT INTO pattern) and LIMIT/OFFSET rather than raising an error.
565568

0 commit comments

Comments
 (0)