Skip to content

Commit a022e52

Browse files
committed
Update README.md
1 parent 25fc5b3 commit a022e52

1 file changed

Lines changed: 42 additions & 10 deletions

File tree

whitepapers/README.md

Lines changed: 42 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,7 @@
1-
# The SHARE Pipeline
1+
# The SHARE System Spec (Code name: Tamandua)
2+
3+
This set of documents exist to both formalize and restructure how information is processed by SHARE.
4+
Over time the "pipline" has changed significantly from its original inception. The vocabulary to describe the process has not, leading to a decent amount of confusion.
25

36
## Vocabulary
47

@@ -7,21 +10,40 @@
710
* State -- Source specific version of a given SHARE Object.
811
* Final -- The representation of an object in the publicly accessibly SHARE dataset.
912
* Harvest -- Collecting data from
10-
* Transform --
11-
* [Data] Cleaning --
1213
* Normalize -- Defunct; see Transform.
14+
* Harvester -- Code responsible for aquiring data to be processed by SHARE
15+
* Transformer -- Code responsible for reformatting harvested data into a SHARE compliant format
16+
* Regulators -- Code responsible for cleaning and validating values expelled by a transformer
17+
* Deduplicator -- The job/code responsible for matching states together to be assembled into the final data set
18+
* Assembler -- The code responsible for selecting the best attributes from individual states
19+
20+
### Why did you pick X word?
21+
* Harvest -- The meaning of this word never actually changed. Used for historic reasons.
22+
* [Transform](http://www.dictionary.com/browse/transform) -- To change in form, appearance, or structure; metamorphose.
23+
* [Regulate](http://www.dictionary.com/browse/regulate) -- To control or direct by a rule, principle, method, etc.
24+
* [Deduplicate]() --
25+
* [Assemble](http://www.dictionary.com/browse/assemble) -- To put or fit together; put together the parts of.
1326

1427
## Overview
1528

16-
### Main pipeline:
29+
### There is no pipeline
30+
31+
```
32+
+--------------+ +--------------------------------------------+
33+
| Harvest Task | ---> | Transform ---> Regulate ---> Consolidate |
34+
+--------------+ +--------------------------------------------+
1735
18-
* Data Ingest Task -- Collect and store data
19-
* Data Process Task -- Parse and clean data
36+
+---------------+
37+
| Deduplication |
38+
+---------------+
2039
21-
### Auxillary Tasks
40+
+----------+
41+
| Assemble |
42+
+----------+
43+
```
2244

23-
* Data Disambiguate Task -- Link together individual records
24-
* Data Build Task -- Intellgently aggregate records describing the same object
45+
At first glance, the SHARE workflow may appear to be a pipeline. It is not. Data only addressed a single enity, briefly, in the begining of the workflow.
46+
Once data is full processed, it is treated as part of a whole. All jobs, aside from the first two boxes, operate on the dataset as a whole.
2547

2648
### Janitor
2749

@@ -54,11 +76,21 @@ The rebuilds will heal any issues in the final dataset
5476

5577

5678

57-
5879
Single global diambigation task
5980
Store tags and subjects as arrays
6081
SUIDS as another table. Lock to prevent racing on rawdata updating
6182

6283
# Ideas
6384
Counters for started, finished, etc on log jobs
6485
Janitor actually deletes states marked as `is_deleted`
86+
87+
Move sources to their own table
88+
89+
90+
Harvesters
91+
Transformers
92+
Regulators
93+
94+
95+
Deduplicator
96+
Assembler

0 commit comments

Comments
 (0)