@@ -154,14 +154,15 @@ bad threshold mechanism only applies to bad data *within* a homogenous type
154154(typically bad string representations of numeric or boolean types).
155155
156156
157- Missing Data (``--empty-threshold ``)
158- ====================================
157+ Missing Data (``--empty-threshold `` and `` --null-threshold `` )
158+ =============================================================
159159
160- Another type of "bad" data commonly encountered is empty strings which are
161- typically used to represent *missing * data, and (predictably) structa has
162- another knob that can be twiddled for this: :option: `structa
163- --empty-threshold `. The following script generates a list of strings of
164- integers in which most of the strings (~70%) are blank:
160+ Another type of "bad" data commonly encountered is empty strings and nulls
161+ which are typically used to represent *missing * data, and (predictably) structa
162+ has more knobs that can be twiddled for this: :option: `structa
163+ --empty-threshold ` and :option: `structa --null-threshold `. The following script
164+ generates a list of strings of integers in which most of the strings (~70%) are
165+ blank:
165166
166167.. literalinclude :: examples/mostly-blank.py
167168 :caption: mostly-blank.py
@@ -174,11 +175,13 @@ normal:
174175 $ python3 mostly-blank.py | structa
175176 [ str of int range=0..100 pattern="d" ]
176177
177- This is because the default for :option: `structa --empty-threshold ` is 99% or
178- 0.99. If the proportion of blank strings in a field exceeds the empty
179- threshold, the field will simply be marked as a string without any further
180- processing. Hence, when we re-run this script with the setting turned down to
181- 50%, the output changes:
178+ This is because the default for both :option: `structa --empty-threshold ` and
179+ :option: `structa --null-threshold ` is 99% or 0.99.
180+
181+ If the proportion of blank strings in a field exceeds the empty threshold, the
182+ field will simply be marked as a string without any further processing. Hence,
183+ when we re-run this script with the setting turned down to 50%, the output
184+ changes:
182185
183186.. code-block :: console
184187
@@ -191,6 +194,11 @@ processing. Hence, when we re-run this script with the setting turned down to
191194 "100" value, but because it's now considered a string (not a string of
192195 integers), "100" sorts before "99" alphabetically.
193196
197+ Likewise, if the proportion of null values in a field exceeds the null
198+ threshold, the field will simply be marked as "value" (an arbitrary mix of
199+ types), because structa assumes there aren't enough values to accurately
200+ represent the type of the field.
201+
194202It is also worth nothing that, by default, structa strips whitespace from
195203strings prior to analysis. This is probably not necessary for the vast majority
196204of modern datasets, but it's a reasonably safe default, and can be controlled
0 commit comments