Skip to content

Commit 8ef1f8e

Browse files
aaxelbchrisseto
authored andcommitted
[Tamandua] Describe models for configuring sources (#588)
* Describe models for configuring sources * SourceConfig => IngestConfig * Consistent table definitions * Define tables using tables. * Fix some table stuff. * Updates
1 parent a022e52 commit 8ef1f8e

3 files changed

Lines changed: 141 additions & 31 deletions

File tree

whitepapers/Tables.md

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,115 @@
11
# SQL Tables
22

3+
## Template
34

5+
### {ModelName}
6+
{Description}
47

8+
| Column | Type | Indexed | Nullable | FK | Default | Description |
9+
|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
10+
| | |||| | |
511

12+
#### Other indices
13+
* `{column_name}`, `{column_name}`, ... [(unique)]
14+
* ...
15+
16+
## Data
17+
18+
### SourceUniqueIdentifier (SUID)
19+
Identifier for a specific document from a specific source.
20+
21+
| Column | Type | Indexed | Nullable | FK | Default | Description |
22+
|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
23+
| `identifier` | text | | | | | Identifier given to the document by the source |
24+
| `ingest_config_id` | int | | || | IngestConfig used to ingest the document |
25+
26+
#### Other indices
27+
* `source_doc_id`, `ingest_config_id` (unique)
28+
29+
### RawData
30+
Raw data, exactly as it was given to SHARE.
31+
32+
| Column | Type | Indexed | Nullable | FK | Default | Description |
33+
|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
34+
| `suid_id` | int | | || | SUID for this datum |
35+
| `data` | text | | | | | The raw data itself (typically JSON or XML string) |
36+
| `sha256` | text | unique | | | | SHA-256 hash of `data` |
37+
| `harvest_logs` | m2m | | | | | List of HarvestLogs for harvester runs that found this exact datum |
38+
39+
## Ingest Configuration
40+
41+
### IngestConfig
42+
Describes one way to harvest metadata from a Source, and how to transform the result.
43+
44+
| Column | Type | Indexed | Nullable | FK | Default | Description |
45+
|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
46+
| `source_id` | int | | || | Source to harvest from |
47+
| `base_url` | text | | | | | URL of the API or endpoint where the metadata is available |
48+
| `earliest_date` | date | || | | Earliest date with available data |
49+
| `rate_limit_allowance` | int | | | | 5 | Number of requests allowed every `rate_limit_period` seconds |
50+
| `rate_limit_period` | int | | | | 1 | Number of seconds for every `rate_limit_allowance` requests |
51+
| `harvester_id` | int | | || | Harvester to use |
52+
| `harvester_kwargs` | jsonb | || | | JSON object passed to the harvester as kwargs |
53+
| `transformer_id` | int | | || | Transformer to use |
54+
| `transformer_kwargs` | jsonb | || | | JSON object passed to the transformer as kwargs, along with the harvested raw data |
55+
| `disabled` | bool | | | | False | True if this ingest config should not be run automatically |
56+
57+
### Source
58+
A Source is a place metadata comes from.
59+
60+
| Column | Type | Indexed | Nullable | FK | Default | Description |
61+
|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
62+
| `name` | text | unique | | | | Short name |
63+
| `long_title` | text | unique | | | | Full, human-friendly name |
64+
| `home_page` | text | || | | URL |
65+
| `icon` | image | || | | Recognizable icon for the source |
66+
| `user_id` | int | | || | User with permission to submit data as this source (TODO: replace with django permissions stuff) |
67+
68+
### Harvester
69+
Each row corresponds to a Harvester implementation in python. (TODO: describe those somewhere)
70+
71+
| Column | Type | Indexed | Nullable | FK | Default | Description |
72+
|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
73+
| `key` | text | unique | | | | Key that can be used to get the corresponding Harvester subclass |
74+
| `date_created` | datetime | | | | now | |
75+
| `date_modified` | datetime | | | | now (on update) | |
76+
77+
### Transformer
78+
Each row corresponds to a Transformer implementation in python. (TODO: describe those somewhere)
79+
80+
| Column | Type | Indexed | Nullable | FK | Default | Description |
81+
|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
82+
| `key` | text | unique | | | | Key that can be used to get the corresponding Transformer subclass |
83+
| `date_created` | datetime | | | | now | |
84+
| `date_modified` | datetime | | | | now (on update) | |
85+
86+
## Logs
87+
88+
### HarvestLog
89+
Log entries to track the status of a specific harvester run.
90+
91+
| Column | Type | Indexed | Nullable | FK | Default | Description |
92+
|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
93+
| `ingest_config_id` | int | | || | IngestConfig for this harvester run |
94+
| `harvester_version` | text | | | | | Semantic version of the harvester, with each segment padded to 3 digits (e.g. '1.2.10' => '001.002.010')
95+
| `start_date` | datetime | | | | | Beginning of the date range to harvest |
96+
| `end_date` | datetime | | | | | End of the date range to harvest |
97+
| `started` | datetime | | | | | Time `status` was set to STARTED |
98+
| `status` | text | | | | INITIAL | Status of the harvester run, one of {INITIAL, STARTED, SPLIT, SUCCEEDED, FAILED} |
99+
100+
#### Other indices
101+
* `ingest_config_id`, `harvester_version`, `start_date`, `end_date` (unique)
102+
103+
### TransformLog
104+
Log entries to track the status of a transform task
105+
106+
| Column | Type | Indexed | Nullable | FK | Default | Description |
107+
|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
108+
| `raw_id` | int | | || | RawData to be transformed |
109+
| `ingest_config_id` | int | | || | IngestConfig used |
110+
| `transformer_version` | text | | | | | Semantic version of the transformer, with each segment padded to 3 digits (e.g. '1.2.10' => '001.002.010')
111+
| `started` | datetime | | | | | Time `status` was set to STARTED |
112+
| `status` | text | | | | INITIAL | Status of the transform task, one of {INITIAL, STARTED, RESCHEDULED, SUCCEEDED, FAILED} |
113+
114+
#### Other indices
115+
* `raw_id`, `transformer_version` (unique)

whitepapers/tasks/Harvest.md

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -17,50 +17,50 @@
1717

1818

1919
## Parameters
20-
* `source_id` -- The PK of the source to harvest from
20+
* `ingest_config_id` -- The PK of the IngestConfig to use
2121
* `start_date` --
2222
* `end_date` --
23-
* `limit` -- The maximum number of documents to collect. Defaults to `None` (Unlimitted)
24-
* `rate_limit` -- Rate limit for network requests. Defaults to `None` (Unlimitted)
23+
* `limit` -- The maximum number of documents to collect. Defaults to `None` (Unlimited)
2524
* `superfluous` -- Take certain actions that have previously suceeded
2625
* `transform` -- Should TransformJobs be launched for collected data. Defaults to `True`
2726
* `no_split` -- Should harvest jobs be split into multiple? Default to `False`
28-
* `ignore_disabled` -- Run the task, even with disabled sources
27+
* `ignore_disabled` -- Run the task, even with disabled ingest configs
2928
* `force` -- Force the task to run, against all odds
3029

3130

3231
## Steps
3332

3433
### Preventative measures
35-
* If the specified `source` is disabled and `force` or `ignore_disabled` is not set, crash
36-
* For the given `source` find up to the last 5 harvest jobs with the same versions
34+
* If the specified `ingest_config` is disabled and `force` or `ignore_disabled` is not set, crash
35+
* For the given `ingest_config` find up to the last 5 harvest jobs with the same harvester versions
3736
* If they are all failed, throw an exception (Refuse to run)
3837

3938
### Setup
40-
* Lock the `source` (NOWAIT)
39+
* Lock the `ingest_config` (NOWAIT)
4140
* On failure, reschedule for a later run. (This should be allowed to happen many times before finally failing)
42-
* Get or create HarvestJob(source_id, version, harvester, date ranges...)
41+
* Get or create HarvestLog(`ingest_config_id`, `harvester_version`, `start_date`, `end_date`)
4342
* if found and status is:
4443
* `SUCCEEDED`, `SPLIT`, or `FAILED`: update timestamps and/or counts.
45-
* STARTED: Log a warning (Should not have been able to lock the source) and update timestamps and/or counts.
46-
* Set HarvestJob status to `STARTED`
44+
* `STARTED`: Log a warning (Should not have been able to lock the `ingest_config`) and update timestamps and/or counts.
45+
* Set HarvestLog status to `STARTED`
4746
* If the specified date range is >= [SOME LENGTH OF TIME] and `no_split` is False
4847
* Chunk the date range and spawn a harvest task for each chunk
4948
* Set status to `SPLIT` and exit
50-
* Load the harvester for the given source
49+
* Load the harvester for the given `ingest_config`
5150

5251
### Actually Harvest
53-
* Harvest data between the specified datetimes, respecting `limit` and `rate_limit`
52+
* Harvest data between the specified datetimes, respecting `limit` and `ingest_config.rate_limit`
5453

5554
### Pass the data along
5655
* Begin catching any exceptions
5756
* For each piece of data recieved (Perferably in bulk/chunks)
58-
* Get or create SourceUniqueIdentifier(suid, source_id)
57+
* Get or create `SourceUniqueIdentifier(suid, source_id)`
58+
* Question: Should SUIDs depend on `ingest_config_id` instead of `source_id`? If we're harvesting data in multiple formats from the same source, we probably want to keep the respective states separate.
5959
* Get or create RawData(hash, suid)
6060
* For each piece of data (After saving to keep as transactional as possible)
61-
* Get or create TransformLogs(raw_id, version)
61+
* Get or create `TransformLog(raw_id, ingest_config_id, transformer_version)`
6262
* if the log already exists and superfluous is not set, exit
63-
* Start the transform task(raw_id, version) unless `transform` is `False`
63+
* Start the `TransformTask(raw_id, ingest_config_id)` unless `transform` is `False`
6464

6565
### Clean up
6666
* If an exception was caught, set status to `FAILED` and insert the exception/traceback

whitepapers/tasks/Transform.md

Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -2,15 +2,15 @@
22

33

44
## Responsibilities
5-
* Parsing data using source specific parsers
5+
* Parsing data using source-specific parsers
66
* Applying global cleaners to the data
7-
* Catching any extranious exceptions and storing them in the ProcessLog and marking the ProcessLog as failed
7+
* Catching any extraneous exceptions, storing them in the TransformLog, and marking the TransformLog `FAILED`
88

99

1010
## Parameters
1111
* `raw_id` --
12-
* `processor_version` --
13-
* `cleaner_version` --
12+
* `transformer_version` --
13+
* `regulator_version` --
1414
* `superfluous` --
1515

1616

@@ -19,23 +19,23 @@
1919
### Setup
2020
* Load RawData by id.
2121
* Crash, if not found.
22-
* If not defined set `processor_version` to the latest.
23-
* If not defined set `cleaner_version` to the latest.
24-
* Find and lock ProcessLog(`raw_id`, `processor_version`) (SELECT FOR UPDATE NOWAIT)
22+
* If not defined set `transformer_version` to the latest.
23+
* If not defined set `regulator_version` to the latest.
24+
* Find and lock TransformLog(`raw_id`, `transformer_version`) (SELECT FOR UPDATE NOWAIT)
2525
* If not found, log an error. Create, Commit, Lock.
26-
* If the create fails, Log an error and exit.
26+
* If the create fails, log an error and exit.
2727
* If the lock times out/isn't granted. Log an error and exit.
28-
* If the found ProcessLog's status is finished/done and `superfluous` is `False` exit.
29-
* Set the status of the ProcessLog to in-progress
28+
* If the found TransformLog's status is `SUCCEEDED` and `superfluous` is `False` exit.
29+
* Set the status of the TransformLog to `STARTED`
3030

3131
### Check for racing
32-
* Search for any equivilent RawData (`document_id`, `source_id`) with an earlier timestamp that has not finished processing
33-
* If found set status to rescheduled and exit
32+
* Search for any equivalent RawData (`document_id`, `source_id`) with an earlier timestamp that has not finished transfoming
33+
* If found set status to `RESCHEDULED` and exit
3434

35-
### Actually process the data
35+
### Actually transform the data
3636
* Start a transaction
37-
* Load the processor
38-
* Process data
37+
* Load the transformer
38+
* Transform data
3939
* Load the cleaning suite
4040
* Clean data
4141

@@ -52,4 +52,4 @@
5252
* Commit transaction
5353
* Release all locks
5454
* Start disambiguation tasks for updated states
55-
* Set ProcessLog status to Done
55+
* Set TransformLog status to `SUCCEEDED`

0 commit comments

Comments
 (0)