|
| 1 | +--- |
| 2 | +title: "Box plots" |
| 3 | +description: "Showing groups of distributions of single numeric variables" |
| 4 | +image: thumbnails/boxplot.svg |
| 5 | +categories: [basic, boxplot, distribution] |
| 6 | +order: 3 |
| 7 | +--- |
| 8 | + |
| 9 | +Boxplots are a popular way to display a summary of a distribution of single continuous variables. |
| 10 | +It is good to keep in mind boxplots hide the actual distribution of the data behind a summary, for example when the data is bi- or multi-modal. |
| 11 | +For every group, a boxplot displays the following 6 things: |
| 12 | + |
| 13 | +1. The 25^th^ percentile, or Q1, as the start of the box. |
| 14 | +2. The 50^th^ percentile, i.e. median or Q2, as a line across the box. |
| 15 | +3. The 75^th^ percentile, or Q3, as the end of the box. Together with Q1 we can compute the interquartile range: IQR = Q3 - Q1. |
| 16 | +4. The minimum data value or Q1 - 1.5 * IQR, whichever is larger. This is displayed as the lower whisker. |
| 17 | +5. The maximum data value or Q3 + 1.5 * IQR, whichever is smaller. This is displayed as the upper whisker. |
| 18 | +6. Outliers outside the whiskers, if present. These are drawn as individual points. |
| 19 | + |
| 20 | +## Code |
| 21 | + |
| 22 | +```{ggsql} |
| 23 | +VISUALISE species AS x, bill_len AS y FROM ggsql:penguins |
| 24 | + DRAW boxplot |
| 25 | +``` |
| 26 | + |
| 27 | +## Explanation |
| 28 | + |
| 29 | +* The `VISUALISE ... FROM ggsql:penguins` loads the built-in penguins dataset. |
| 30 | +* `species AS x` sets a categorical variable to separate different groups. |
| 31 | +* `bill_len AS y` sets the numeric variable to summarise. |
| 32 | +* `DRAW boxplot` gives instructions to draw the boxplot layer. |
| 33 | + |
| 34 | +## Variations |
| 35 | + |
| 36 | +### Dodging |
| 37 | + |
| 38 | +You can refine groups beyond the axis categorical variable, and the boxplots will be displayed in a dodged way. |
| 39 | + |
| 40 | +```{ggsql} |
| 41 | +VISUALISE species AS x, bill_len AS y, island AS fill FROM ggsql:penguins |
| 42 | + DRAW boxplot |
| 43 | +``` |
| 44 | + |
| 45 | +However, dodging might be unproductive or counterintuitive in some cases. |
| 46 | +For example if we double-encode groups, like `species` as both `x` *and* `fill` in the plot below, dodging looks bad. |
| 47 | + |
| 48 | +```{ggsql} |
| 49 | +VISUALISE species AS x, bill_len AS y, species AS fill FROM ggsql:penguins |
| 50 | + DRAW boxplot |
| 51 | +``` |
| 52 | + |
| 53 | +We can disable the dodging by setting `position => 'identity'`. |
| 54 | + |
| 55 | +```{ggsql} |
| 56 | +VISUALISE species AS x, bill_len AS y, species AS fill FROM ggsql:penguins |
| 57 | + DRAW boxplot SETTING position => 'identity' |
| 58 | +``` |
| 59 | + |
| 60 | +### Horizontal |
| 61 | + |
| 62 | +To draw the boxplots horizontally, simply swap the `x` and `y` mapping. |
| 63 | +The orientation is detected automatically based on which variable is continuous and which is discrete. |
| 64 | + |
| 65 | +```{ggsql} |
| 66 | +VISUALISE bill_len AS x, species AS y, island AS fill FROM ggsql:penguins |
| 67 | + DRAW boxplot |
| 68 | +``` |
| 69 | + |
| 70 | +### With individual datapoints |
| 71 | + |
| 72 | +Because a boxplot is a summary, it may be a good idea to supplement them with individual datapoints so that you can't be accused of 'hiding' the distribution. |
| 73 | +The datapoints can be jittered by setting `position => 'jitter'`. |
| 74 | +When you do this, make sure to turn `outliers => false` to not draw the outlier points twice across the two layers. |
| 75 | + |
| 76 | +<!-- TODO: Figure out why the boxplot width is so small --> |
| 77 | + |
| 78 | +```{ggsql} |
| 79 | +VISUALISE species AS x, bill_len AS y FROM ggsql:penguins |
| 80 | + DRAW point SETTING position => 'jitter' |
| 81 | + DRAW boxplot SETTING outliers => false |
| 82 | +``` |
| 83 | + |
| 84 | + |
0 commit comments