|
1 | | -# The SHARE Pipeline |
| 1 | +# The SHARE System Spec (Code name: Tamandua) |
| 2 | + |
| 3 | +This set of documents exist to both formalize and restructure how information is processed by SHARE. |
| 4 | +Over time the "pipline" has changed significantly from its original inception. The vocabulary to describe the process has not, leading to a decent amount of confusion. |
2 | 5 |
|
3 | 6 | ## Vocabulary |
4 | 7 |
|
|
7 | 10 | * State -- Source specific version of a given SHARE Object. |
8 | 11 | * Final -- The representation of an object in the publicly accessibly SHARE dataset. |
9 | 12 | * Harvest -- Collecting data from |
10 | | -* Transform -- |
11 | | -* [Data] Cleaning -- |
12 | 13 | * Normalize -- Defunct; see Transform. |
| 14 | +* Harvester -- Code responsible for aquiring data to be processed by SHARE |
| 15 | +* Transformer -- Code responsible for reformatting harvested data into a SHARE compliant format |
| 16 | +* Regulators -- Code responsible for cleaning and validating values expelled by a transformer |
| 17 | +* Deduplicator -- The job/code responsible for matching states together to be assembled into the final data set |
| 18 | +* Assembler -- The code responsible for selecting the best attributes from individual states |
| 19 | + |
| 20 | +### Why did you pick X word? |
| 21 | +* Harvest -- The meaning of this word never actually changed. Used for historic reasons. |
| 22 | +* [Transform](http://www.dictionary.com/browse/transform) -- To change in form, appearance, or structure; metamorphose. |
| 23 | +* [Regulate](http://www.dictionary.com/browse/regulate) -- To control or direct by a rule, principle, method, etc. |
| 24 | +* [Deduplicate]() -- |
| 25 | +* [Assemble](http://www.dictionary.com/browse/assemble) -- To put or fit together; put together the parts of. |
13 | 26 |
|
14 | 27 | ## Overview |
15 | 28 |
|
16 | | -### Main pipeline: |
| 29 | +### There is no pipeline |
| 30 | + |
| 31 | +``` |
| 32 | + +--------------+ +--------------------------------------------+ |
| 33 | + | Harvest Task | ---> | Transform ---> Regulate ---> Consolidate | |
| 34 | + +--------------+ +--------------------------------------------+ |
17 | 35 |
|
18 | | -* Data Ingest Task -- Collect and store data |
19 | | -* Data Process Task -- Parse and clean data |
| 36 | + +---------------+ |
| 37 | + | Deduplication | |
| 38 | + +---------------+ |
20 | 39 |
|
21 | | -### Auxillary Tasks |
| 40 | + +----------+ |
| 41 | + | Assemble | |
| 42 | + +----------+ |
| 43 | +``` |
22 | 44 |
|
23 | | -* Data Disambiguate Task -- Link together individual records |
24 | | -* Data Build Task -- Intellgently aggregate records describing the same object |
| 45 | +At first glance, the SHARE workflow may appear to be a pipeline. It is not. Data only addressed a single enity, briefly, in the begining of the workflow. |
| 46 | +Once data is full processed, it is treated as part of a whole. All jobs, aside from the first two boxes, operate on the dataset as a whole. |
25 | 47 |
|
26 | 48 | ### Janitor |
27 | 49 |
|
@@ -54,11 +76,21 @@ The rebuilds will heal any issues in the final dataset |
54 | 76 |
|
55 | 77 |
|
56 | 78 |
|
57 | | - |
58 | 79 | Single global diambigation task |
59 | 80 | Store tags and subjects as arrays |
60 | 81 | SUIDS as another table. Lock to prevent racing on rawdata updating |
61 | 82 |
|
62 | 83 | # Ideas |
63 | 84 | Counters for started, finished, etc on log jobs |
64 | 85 | Janitor actually deletes states marked as `is_deleted` |
| 86 | + |
| 87 | +Move sources to their own table |
| 88 | + |
| 89 | + |
| 90 | +Harvesters |
| 91 | +Transformers |
| 92 | +Regulators |
| 93 | + |
| 94 | + |
| 95 | +Deduplicator |
| 96 | +Assembler |
0 commit comments