Skip to content

Commit b7613f6

Browse files
committed
Add white-space productions to the Selectors grammar implementation
The missing white-space broke parsing of selectors, with the latter not having any tests in place to help uncover the issue. This adds handling of white-space through explicit references in the grammar (parsing procedures don't have to be amended), to match the specified behaviour (including that defined with prose).
2 parents 0a41f2b + 5464575 commit b7613f6

5 files changed

Lines changed: 40 additions & 52 deletions

File tree

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,7 @@ Parsing is offered only in the form of Python modules — no "command-line" prog
7171

7272
### Why?
7373

74-
We wanted a "transparent" CSS parser — one that one could be used in different configurations without it imposing limitations that would strictly speaking go beyond parsing. Put differently, we wanted a parser that does not assume any particular application, a software _library_ in the classical sense of the term, or a true _API_ if you will.
74+
We wanted a "transparent" CSS parser — one that could be used in different configurations without it imposing limitations that would strictly speaking go beyond parsing. Put differently, we wanted a parser that does not assume any particular application, a software _library_ in the classical sense of the term, or a true _API_ if you will.
7575

7676
For instance, the popular [Less](http://lesscss.org) software seems to rather effortlessly parse CSS [3] text, but it invariably re-arranges white-space in the output, without giving the user any control over the latter. Less is not _transparent_ like that — there is no way to use it with recovery of the originally parsed text from the parse tree — parsing with Less is a one-way street for at least _some_ applications (specifically those that "transform" CSS but need to preserve all of the original input as-is).
7777

expand-macros.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
33
Macro processing refers here to eager rewriting/replacement/substitution of Python code constructs decorated with the "syntactic" (no definition available normally, when the containing module is imported) decorator `macro`. The purpose of such processing is to implement the equivalent to what is usually called "pre-processing" for e.g. C/C++ language(s). As `macro`-decorated procedures (only decorating of procedures is currently effectively supported for `macro`) are encountered during processing of Python code, the entire procedure is removed and "unparsed" equivalent of the series of AST statements it returned, are inserted in its place instead.
44
5-
This implements powerful and "semantically-aware" code pre-processing mechanism, for situations demanding it. Our immediate need with this was to allow type checkers like MyPy to be able to analyze as much of the project's Python code as possible, which these are normally unable to do in cases of so-called dynamically created types (and consequently object(s) of such types). And so instead of living with effectively uncheckable dynamic types created with the `type` built-in -- for e.g. `Token` subclasses -- we employ _pre-processing_ of Python code into Python code which lends to type-checking, a benefit we deemed to ba a "must-have" for the project.
5+
This implements powerful and "semantically-aware" code pre-processing mechanism, for situations demanding it. Our immediate need with this was to allow type checkers like MyPy to be able to analyze as much of the project's Python code as possible, which these are normally unable to do in cases of so-called dynamically created types (and consequently object(s) of such types). And so instead of living with effectively uncheckable dynamic types created with the `type` built-in -- for e.g. `Token` subclasses -- we employ _pre-processing_ of Python code into Python code which lends to type-checking, a benefit we deemed to be a "must-have" for the project.
66
"""
77

88
import ast

src/csspring/selectors.py

Lines changed: 25 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,8 @@
1111
from .syntax.tokenizing import Token, BadStringToken, BadURLToken, CloseBraceToken, CloseBracketToken, CloseParenToken, ColonToken, DelimToken, FunctionToken, HashToken, IdentToken, OpenBraceToken, OpenBracketToken, OpenParenToken, StringToken
1212

1313
from .syntax.grammar import any_value
14-
from .values import Production, AlternativesProduction, CommaSeparatedRepetitionProduction, ConcatenationProduction, NonEmptyProduction, OptionalProduction, ReferenceProduction, RepetitionProduction, TokenProduction
14+
from .values import Production, AlternativesProduction, CommaSeparatedRepetitionProduction, ConcatenationProduction, NonEmptyProduction, OptionalProduction, ReferenceProduction, RepetitionProduction, TokenProduction, OWS
15+
from .utils import intersperse
1516

1617
from functools import singledispatch
1718
from typing import cast
@@ -67,29 +68,6 @@ def parse_any_value(input: TokenStream) -> Product | None:
6768
else:
6869
return None
6970

70-
@parse.register
71-
def _(production: CommaSeparatedRepetitionProduction, input: TokenStream) -> Product | None:
72-
"""Variant of `parse` for productions of the `#` multiplier variety (see https://drafts.csswg.org/css-values-4/#mult-comma)."""
73-
result: list[Product | Token] = []
74-
input.mark()
75-
while True:
76-
value: Product | Token | None
77-
if result:
78-
value = parse(production.delimiter, input)
79-
if value is None:
80-
break
81-
result.append(value)
82-
value = parse(production.element, input)
83-
if value is None:
84-
break
85-
result.append(value)
86-
if result:
87-
input.discard_mark()
88-
return result
89-
else:
90-
input.restore_mark()
91-
return None
92-
9371
@parse.register
9472
def _(production: ConcatenationProduction, input: TokenStream) -> Product | None:
9573
"""Variant of `parse` for productions of the ` ` combinator variety (see "juxtaposing components" at https://drafts.csswg.org/css-values-4/#component-combinators)."""
@@ -106,7 +84,7 @@ def _(production: ConcatenationProduction, input: TokenStream) -> Product | None
10684
@parse.register
10785
def _(production: NonEmptyProduction, input: TokenStream) -> Product | None:
10886
"""Variant of `parse` for productions of the `!` multiplier variety (see https://drafts.csswg.org/css-values-4/#mult-req)."""
109-
result = cast(Product, parse(production.element, input)) # The element of a non-empty production is concatenation, and the `parse` overload for `ConcatenationProduction` never returns a `Token`, only `Product | None`
87+
result = cast(Product | None, parse(production.element, input)) # The element of a non-empty production is concatenation, and the `parse` overload for `ConcatenationProduction` never returns a `Token`, only `Product | None`
11088
if result and any(tokens(result)):
11189
return result
11290
else:
@@ -126,9 +104,21 @@ def _(production: RepetitionProduction, input: TokenStream) -> Product | None:
126104
result: list[Product | Token] = []
127105
input.mark()
128106
while True:
107+
if result and production.separator:
108+
input.mark()
109+
separator = parse(production.separator, input)
110+
if separator is None:
111+
input.restore_mark()
112+
break
129113
value = parse(production.element, input)
130114
if value is None:
115+
if result and production.separator:
116+
input.restore_mark()
131117
break
118+
if result and production.separator:
119+
assert separator is not None
120+
result.append(separator)
121+
input.discard_mark()
132122
result.append(value)
133123
if len(result) == production.max:
134124
break
@@ -157,13 +147,19 @@ def parse_selector_list(input: TokenStream) -> Product | None:
157147
158148
Parsing of selector lists is the _reason d'etre_ for this module and this is the [convenience] procedure that exposes the feature.
159149
"""
160-
return cast(Product | None, parse(grammar.selector_list, input))
150+
return cast(Product | None, parse(ConcatenationProduction(OWS, grammar.selector_list, OWS), input))
161151

162152
class Grammar:
163153
"""The grammar defining the language of selector list expressions.
164154
165155
Normally a grammar would be defined as a set of rules (for deriving productions), where each rule would feature a component to the left side of the `->` operator (the "rewriting" operator) and a component to the right side of the operator. Owing to relative simplicity of the Selectors grammar -- where the left-hand side component is always a production name _reference_ (an identifying factor of context free grammars), we leverage Python's meta-programming facilities and use class attribute assignment statements to define the rules instead, where the assigned value is the right side of the rule, an arbitrary production (which may be an opaque value). Each attribute of the grammar is assigned the corresponding name automatically, owing to the `__set_name__` dunder method of the common production (super)class (where appropriate).
166156
157+
NOTE: Some of the productions as defined in the specification, have been rewritten below to eliminate repetition. These rewritten productions are marked accordingly, for clarity.
158+
159+
NOTE: `intersperse` is used to insert white-space productions as required by the specification, which otherwise doesn't include them explicitly, instead describing white-space handling "in prose".
160+
161+
NOTE: There is no notation (defined by the Values & Units spec.) for expressing `RepetitionProduction` productions with a `separator` attribute value other than `None` (the '[ ... ]*' variant) or that of `CommaSeparatedRepetitionProduction` (the '[ ... ]#' variant). Nevertheless, these productions are employed below to eliminate repetition as part of optimizing the grammar.
162+
167163
Implements http://drafts.csswg.org/selectors-4/#grammar.
168164
"""
169165
ns_prefix = ConcatenationProduction(OptionalProduction(AlternativesProduction(TokenProduction(IdentToken), TokenProduction(DelimToken, value='*'))), TokenProduction(DelimToken, value='|'))
@@ -173,26 +169,17 @@ class Grammar:
173169
class_selector = ConcatenationProduction(TokenProduction(DelimToken, value='.'), TokenProduction(IdentToken))
174170
attr_matcher = ConcatenationProduction(OptionalProduction(AlternativesProduction(*(TokenProduction(DelimToken, value=value) for value in ('~', '|', '^', '$', '*')))), TokenProduction(DelimToken, value='='))
175171
attr_modifier = AlternativesProduction(*(TokenProduction(DelimToken, value=value) for value in ('i', 's')))
176-
attribute_selector = AlternativesProduction(ConcatenationProduction(TokenProduction(OpenBracketToken), ReferenceProduction(wq_name), TokenProduction(CloseBracketToken)), ConcatenationProduction(TokenProduction(OpenBracketToken), ReferenceProduction(wq_name), ReferenceProduction(attr_matcher), AlternativesProduction(TokenProduction(StringToken), TokenProduction(IdentToken)), OptionalProduction(ReferenceProduction(attr_modifier)), TokenProduction(CloseBracketToken)))
172+
attribute_selector = ConcatenationProduction(*intersperse(TokenProduction(OpenBracketToken), ReferenceProduction(wq_name), OptionalProduction(ConcatenationProduction(*intersperse(ReferenceProduction(attr_matcher), AlternativesProduction(TokenProduction(StringToken), TokenProduction(IdentToken)), OptionalProduction(ReferenceProduction(attr_modifier)), separator=OWS))), TokenProduction(CloseBracketToken), separator=OWS)) # Rewritten
177173
legacy_pseudo_element_selector = ConcatenationProduction(TokenProduction(ColonToken), AlternativesProduction(*(TokenProduction(IdentToken, value=value) for value in ('before', 'after', 'first-line', 'first-letter'))))
178-
pseudo_class_selector = AlternativesProduction(ConcatenationProduction(TokenProduction(ColonToken), TokenProduction(IdentToken)), ConcatenationProduction(TokenProduction(ColonToken), TokenProduction(FunctionToken), ReferenceProduction(any_value), TokenProduction(CloseParenToken)))
174+
pseudo_class_selector = ConcatenationProduction(TokenProduction(ColonToken), AlternativesProduction(TokenProduction(IdentToken), ConcatenationProduction(TokenProduction(FunctionToken), ReferenceProduction(any_value), TokenProduction(CloseParenToken)))) # Rewritten
179175
pseudo_element_selector = AlternativesProduction(ConcatenationProduction(TokenProduction(ColonToken), ReferenceProduction(pseudo_class_selector)), ReferenceProduction(legacy_pseudo_element_selector))
180176
pseudo_compound_selector = ConcatenationProduction(ReferenceProduction(pseudo_element_selector), RepetitionProduction(ReferenceProduction(pseudo_class_selector)))
181177
subclass_selector = AlternativesProduction(ReferenceProduction(id_selector), ReferenceProduction(class_selector), ReferenceProduction(attribute_selector), ReferenceProduction(pseudo_class_selector))
182178
compound_selector = NonEmptyProduction(ConcatenationProduction(OptionalProduction(ReferenceProduction(type_selector)), RepetitionProduction(ReferenceProduction(subclass_selector))))
183179
complex_selector_unit = NonEmptyProduction(ConcatenationProduction(OptionalProduction(ReferenceProduction(compound_selector)), RepetitionProduction(ReferenceProduction(pseudo_compound_selector))))
184180
combinator = AlternativesProduction(*(TokenProduction(DelimToken, value=value) for value in ('>', '+', '~')), ConcatenationProduction(*(TokenProduction(DelimToken, value=value) for value in ('|', '|'))))
185-
complex_selector = ConcatenationProduction(ReferenceProduction(complex_selector_unit), RepetitionProduction(ConcatenationProduction(OptionalProduction(ReferenceProduction(combinator)), ReferenceProduction(complex_selector_unit))))
181+
complex_selector = RepetitionProduction(ReferenceProduction(complex_selector_unit), min=1, separator=AlternativesProduction(ConcatenationProduction(OWS, ReferenceProduction(combinator), OWS), OWS)) # Rewritten
186182
complex_selector_list = CommaSeparatedRepetitionProduction(ReferenceProduction(complex_selector))
187183
selector_list = ReferenceProduction(complex_selector_list)
188-
complex_real_selector = ConcatenationProduction(ReferenceProduction(compound_selector), RepetitionProduction(ConcatenationProduction(OptionalProduction(ReferenceProduction(combinator)), ReferenceProduction(compound_selector))))
189-
complex_real_selector_list = CommaSeparatedRepetitionProduction(ReferenceProduction(complex_real_selector))
190-
compound_selector_list = CommaSeparatedRepetitionProduction(ReferenceProduction(compound_selector))
191-
simple_selector = AlternativesProduction(ReferenceProduction(type_selector), ReferenceProduction(subclass_selector))
192-
simple_selector_list = CommaSeparatedRepetitionProduction(ReferenceProduction(simple_selector))
193-
relative_selector = ConcatenationProduction(OptionalProduction(ReferenceProduction(combinator)), ReferenceProduction(complex_selector))
194-
relative_selector_list = CommaSeparatedRepetitionProduction(ReferenceProduction(relative_selector))
195-
relative_real_selector = ConcatenationProduction(OptionalProduction(ReferenceProduction(combinator)), ReferenceProduction(complex_real_selector))
196-
relative_real_selector_list = CommaSeparatedRepetitionProduction(ReferenceProduction(relative_real_selector))
197184

198185
grammar = Grammar()

src/csspring/syntax/parsing.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -411,7 +411,8 @@ def consume_list_of_component_values(input: Input, *, stop_token: type[Token | N
411411

412412
def consume_simple_block(input: Input, *, to: Appender[SimpleBlock]) -> SimpleBlock:
413413
"""Implements http://drafts.csswg.org/css-syntax/#consume-simple-block."""
414-
assert isinstance(token := input.next_token(), (OpenBraceToken, OpenBracketToken, OpenParenToken))
414+
token = input.next_token()
415+
assert isinstance(token, (OpenBraceToken, OpenBracketToken, OpenParenToken))
415416
ending_token = token.mirror_type
416417
block = SimpleBlock()
417418
consume_token(input, to=block)

src/csspring/values.py

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -80,21 +80,25 @@ class RepetitionProduction(Production):
8080
8181
Implements the `*` notation as defined at http://drafts.csswg.org/css-values-4/#mult-zero-plus.
8282
"""
83+
separator: Production | None = None
8384
element: Production
8485
min: int
8586
max: int | None
86-
def __init__(self, element: Production, min: int = 0, max: int | None = None):
87+
def __init__(self, element: Production, min: int = 0, max: int | None = None, *, separator: Production | None = None):
8788
"""
8889
:param element: The production expressing the repeating part of this production
8990
:param min: The minimum amount of times the parser must accept input, i.e. the minimum number of repetitions of token sequences accepted by the parser
9091
:param max: The maximum amount of times the parser will be called, i.e. the maximum number of repetitions that may be consumed in the input; the value of `None` implies no maximum (i.e. no upper bound on repetition)
92+
:param separator: A production expressing the "delimiting" part between any two repetitions of the `element` production; if omitted or `None`, there's _no_ delimiting part -- repetitions are _adjacent_
9193
"""
9294
assert min >= 0
9395
assert max is None or max > 0
9496
assert max is None or min <= max
9597
self.min = min
9698
self.max = max
9799
self.element = element
100+
if separator:
101+
self.separator = separator
98102

99103
class OptionalProduction(RepetitionProduction):
100104
"""Class of productions equivalent to `RepetitionProduction` with no lower bound and accepting no repetition of the element, meaning the element is expressed at most once.
@@ -117,22 +121,18 @@ def __init__(self, type: builtins.type[Token], **attributes):
117121
self.type = type
118122
self.attributes = attributes
119123

124+
OWS = optional_whitespace = RepetitionProduction(TokenProduction(WhitespaceToken))
120125
whitespace = RepetitionProduction(TokenProduction(WhitespaceToken), min=1) # The white-space production; presence of white-space expressed with this production, is _mandatory_ (`min=1`); the definition was "hoisted" here because a) it depends on `RepetitionProduction` and `TokenProduction` definitions, which must thus precede it, and b) because the `CommaSeparatedRepetitionParser` definition that follows, depends on it, in turn
121126

122-
class CommaSeparatedRepetitionProduction(Production):
127+
class CommaSeparatedRepetitionProduction(RepetitionProduction):
123128
"""Class of productions that express a non-empty comma-separated repetition (CSR) of a production element.
124129
125-
Unlike `RepetitionProduction` which permits arbitrary number of the production element, this class does not currently implement arbitrary repetition bounds. The delimiting part (a comma optionally surrounded by white-space) is mandatory, which implies at least one repetition (two expressions of the element). Disregarding the delimiting behaviour, productions of this class thus behave like those of `RepetitionProduction` with `2` for `min` and `None` for `max` property values.
126-
127130
Implements the `#` notation as defined at http://drafts.csswg.org/css-values-4/#mult-comma.
128131
"""
129-
delimiter = ConcatenationProduction(OptionalProduction(AlternativesProduction(whitespace, TokenProduction(CommentToken))), TokenProduction(CommaToken), OptionalProduction(AlternativesProduction(whitespace, TokenProduction(CommentToken)))) # The production expressing the delimiter to use with the repetition, a comma with [optional] white-space around it
130-
element: Production
131-
def __init__(self, element: Production):
132-
"""
133-
:param element: A production to use for expressing the repeating part in this production
134-
"""
135-
self.element = element
132+
separator = ConcatenationProduction(OWS, TokenProduction(CommaToken), OWS) # A comma with [optional] white-space around it
133+
def __init__(self, element: Production, min: int = 1, max: int | None = None):
134+
assert min >= 1 # "one or more times" (ref. definition); the spec. does not define whether a minimum of zero is permitted, so we err on the safer side
135+
super().__init__(element, min, max)
136136

137137
class Formatter:
138138
"""Class of objects that offer procedures for serializing productions into streams of text formatted per the [value definition syntax](https://drafts.csswg.org/css-values-4/#value-defs)."""

0 commit comments

Comments
 (0)