Skip to content

Commit 756239d

Browse files
committed
Add --null-threshold to the docs
Closes: #20
1 parent 6a770e3 commit 756239d

2 files changed

Lines changed: 26 additions & 13 deletions

File tree

docs/manual.rst

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ Synopsis
1515
1616
structa [-h] [--version] [-f {auto,csv,json,yaml}] [-e ENCODING]
1717
[--encoding-strict] [--no-encoding-strict]
18-
[-F INT] [-M NUM] [-B NUM] [-E NUM] [--str-limit NUM]
18+
[-F INT] [-M NUM] [-B NUM] [-E NUM] [-N NUM] [--str-limit NUM]
1919
[--hide-count] [--show-count] [--hide-lengths] [--show-lengths]
2020
[--hide-pattern] [--show-pattern]
2121
[--hide-range] [--show-range {hidden,limits,median,quartiles,graph}]
@@ -91,6 +91,11 @@ Optional Arguments
9191
the pattern from being reported; the proportion of "empty" data permitted
9292
in a field (default: 99%)
9393

94+
.. option:: -N NUM, --null-threshold NUM
95+
96+
The proportion of values permitted to be null without preventing type
97+
analysis (default: 99%)
98+
9499
.. option:: --str-limit NUM
95100

96101
The length beyond which only the lengths of strs will be reported; below

docs/tutorial_basic.rst

Lines changed: 20 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -154,14 +154,15 @@ bad threshold mechanism only applies to bad data *within* a homogenous type
154154
(typically bad string representations of numeric or boolean types).
155155

156156

157-
Missing Data (``--empty-threshold``)
158-
====================================
157+
Missing Data (``--empty-threshold`` and ``--null-threshold``)
158+
=============================================================
159159

160-
Another type of "bad" data commonly encountered is empty strings which are
161-
typically used to represent *missing* data, and (predictably) structa has
162-
another knob that can be twiddled for this: :option:`structa
163-
--empty-threshold`. The following script generates a list of strings of
164-
integers in which most of the strings (~70%) are blank:
160+
Another type of "bad" data commonly encountered is empty strings and nulls
161+
which are typically used to represent *missing* data, and (predictably) structa
162+
has more knobs that can be twiddled for this: :option:`structa
163+
--empty-threshold` and :option:`structa --null-threshold`. The following script
164+
generates a list of strings of integers in which most of the strings (~70%) are
165+
blank:
165166

166167
.. literalinclude:: examples/mostly-blank.py
167168
:caption: mostly-blank.py
@@ -174,11 +175,13 @@ normal:
174175
$ python3 mostly-blank.py | structa
175176
[ str of int range=0..100 pattern="d" ]
176177
177-
This is because the default for :option:`structa --empty-threshold` is 99% or
178-
0.99. If the proportion of blank strings in a field exceeds the empty
179-
threshold, the field will simply be marked as a string without any further
180-
processing. Hence, when we re-run this script with the setting turned down to
181-
50%, the output changes:
178+
This is because the default for both :option:`structa --empty-threshold` and
179+
:option:`structa --null-threshold` is 99% or 0.99.
180+
181+
If the proportion of blank strings in a field exceeds the empty threshold, the
182+
field will simply be marked as a string without any further processing. Hence,
183+
when we re-run this script with the setting turned down to 50%, the output
184+
changes:
182185

183186
.. code-block:: console
184187
@@ -191,6 +194,11 @@ processing. Hence, when we re-run this script with the setting turned down to
191194
"100" value, but because it's now considered a string (not a string of
192195
integers), "100" sorts before "99" alphabetically.
193196

197+
Likewise, if the proportion of null values in a field exceeds the null
198+
threshold, the field will simply be marked as "value" (an arbitrary mix of
199+
types), because structa assumes there aren't enough values to accurately
200+
represent the type of the field.
201+
194202
It is also worth nothing that, by default, structa strips whitespace from
195203
strings prior to analysis. This is probably not necessary for the vast majority
196204
of modern datasets, but it's a reasonably safe default, and can be controlled

0 commit comments

Comments
 (0)