[Tamandua] Describe models for configuring sources (#588)

aaxelb · chrisseto · commit 8ef1f8e5f6dd · 2017-03-10T10:25:38.000-05:00
* Describe models for configuring sources

* SourceConfig =&gt; IngestConfig

* Consistent table definitions

* Define tables using tables.

* Fix some table stuff.

* Updates
diff --git a/whitepapers/Tables.md b/whitepapers/Tables.md
@@ -1,5 +1,115 @@
 # SQL Tables
 
+## Template
 
+### {ModelName}
+{Description}
 
+| Column | Type | Indexed | Nullable | FK | Default | Description |
+|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
+| | | ✓ | ✓ | ✓ | | |
 
+#### Other indices
+* `{column_name}`, `{column_name}`, ... [(unique)]
+* ...
+
+## Data
+
+### SourceUniqueIdentifier (SUID)
+Identifier for a specific document from a specific source.
+
+| Column | Type | Indexed | Nullable | FK | Default | Description |
+|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
+| `identifier` | text |  |  |  |  | Identifier given to the document by the source |
+| `ingest_config_id` | int |  |  | ✓ |  | IngestConfig used to ingest the document |
+
+#### Other indices
+* `source_doc_id`, `ingest_config_id` (unique)
+
+### RawData
+Raw data, exactly as it was given to SHARE.
+
+| Column | Type | Indexed | Nullable | FK | Default | Description |
+|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
+| `suid_id` | int |  |  | ✓ | | SUID for this datum |
+| `data` | text |  |  |  | | The raw data itself (typically JSON or XML string) |
+| `sha256` | text | unique |  |  | | SHA-256 hash of `data` |
+| `harvest_logs` | m2m |  |  |  |  | List of HarvestLogs for harvester runs that found this exact datum |
+
+## Ingest Configuration
+
+### IngestConfig
+Describes one way to harvest metadata from a Source, and how to transform the result.
+
+| Column | Type | Indexed | Nullable | FK | Default | Description |
+|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
+| `source_id` | int |  |  | ✓ | | Source to harvest from |
+| `base_url` | text |  |  |  | | URL of the API or endpoint where the metadata is available |
+| `earliest_date` | date |  | ✓ |  | | Earliest date with available data |
+| `rate_limit_allowance` | int |  |  |  | 5 | Number of requests allowed every `rate_limit_period` seconds |
+| `rate_limit_period` | int |  |  |  | 1 | Number of seconds for every `rate_limit_allowance` requests |
+| `harvester_id` | int |  |  | ✓ | | Harvester to use |
+| `harvester_kwargs` | jsonb |  | ✓ |  | | JSON object passed to the harvester as kwargs |
+| `transformer_id` | int |  |  | ✓ | | Transformer to use |
+| `transformer_kwargs` | jsonb |  | ✓ |  | | JSON object passed to the transformer as kwargs, along with the harvested raw data |
+| `disabled` | bool |  |  |  | False | True if this ingest config should not be run automatically |
+
+### Source
+A Source is a place metadata comes from.
+
+| Column | Type | Indexed | Nullable | FK | Default | Description |
+|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
+| `name` | text | unique |  |  | | Short name |
+| `long_title` | text | unique |  |  | | Full, human-friendly name |
+| `home_page` | text |  | ✓ |  | | URL |
+| `icon` | image |  | ✓ |  | | Recognizable icon for the source |
+| `user_id` | int |  |  | ✓ | | User with permission to submit data as this source (TODO: replace with django permissions stuff) |
+
+### Harvester
+Each row corresponds to a Harvester implementation in python. (TODO: describe those somewhere)
+
+| Column | Type | Indexed | Nullable | FK | Default | Description |
+|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
+| `key` | text | unique |  |  | | Key that can be used to get the corresponding Harvester subclass |
+| `date_created` | datetime |  |  |  | now | |
+| `date_modified` | datetime |  |  |  | now (on update) | |
+
+### Transformer
+Each row corresponds to a Transformer implementation in python. (TODO: describe those somewhere)
+
+| Column | Type | Indexed | Nullable | FK | Default | Description |
+|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
+| `key` | text | unique |  |  | | Key that can be used to get the corresponding Transformer subclass |
+| `date_created` | datetime |  |  |  | now | |
+| `date_modified` | datetime |  |  |  | now (on update) | |
+
+## Logs
+
+### HarvestLog
+Log entries to track the status of a specific harvester run.
+
+| Column | Type | Indexed | Nullable | FK | Default | Description |
+|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
+| `ingest_config_id` | int |  |  | ✓ | | IngestConfig for this harvester run |
+| `harvester_version` | text |  |  |  | | Semantic version of the harvester, with each segment padded to 3 digits (e.g. '1.2.10' => '001.002.010')
+| `start_date` | datetime |  |  |  | | Beginning of the date range to harvest |
+| `end_date` | datetime |  |  |  | | End of the date range to harvest |
+| `started` | datetime |  |  |  | | Time `status` was set to STARTED |
+| `status` | text |  |  |  | INITIAL | Status of the harvester run, one of {INITIAL, STARTED, SPLIT, SUCCEEDED, FAILED} |
+
+#### Other indices
+* `ingest_config_id`, `harvester_version`, `start_date`, `end_date` (unique)
+
+### TransformLog
+Log entries to track the status of a transform task
+
+| Column | Type | Indexed | Nullable | FK | Default | Description |
+|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
+| `raw_id` | int |  |  | ✓ | | RawData to be transformed |
+| `ingest_config_id` | int |  |  | ✓ | | IngestConfig used |
+| `transformer_version` | text |  |  |  | | Semantic version of the transformer, with each segment padded to 3 digits (e.g. '1.2.10' => '001.002.010')
+| `started` | datetime |  |  |  | | Time `status` was set to STARTED |
+| `status` | text |  |  |  | INITIAL | Status of the transform task, one of {INITIAL, STARTED, RESCHEDULED, SUCCEEDED, FAILED} |
+
+#### Other indices
+* `raw_id`, `transformer_version` (unique)
diff --git a/whitepapers/tasks/Harvest.md b/whitepapers/tasks/Harvest.md
@@ -17,50 +17,50 @@
 
 
 ## Parameters
-* `source_id` -- The PK of the source to harvest from
+* `ingest_config_id` -- The PK of the IngestConfig to use
 * `start_date` --
 * `end_date` -- 
-* `limit` -- The maximum number of documents to collect. Defaults to `None` (Unlimitted)
-* `rate_limit` -- Rate limit for network requests. Defaults to `None` (Unlimitted)
+* `limit` -- The maximum number of documents to collect. Defaults to `None` (Unlimited)
 * `superfluous` -- Take certain actions that have previously suceeded
 * `transform` -- Should TransformJobs be launched for collected data. Defaults to `True`
 * `no_split` -- Should harvest jobs be split into multiple? Default to `False`
-* `ignore_disabled` -- Run the task, even with disabled sources
+* `ignore_disabled` -- Run the task, even with disabled ingest configs
 * `force` -- Force the task to run, against all odds
 
 
 ## Steps
 
 ### Preventative measures
-* If the specified `source` is disabled and `force` or `ignore_disabled` is not set, crash
-* For the given `source` find up to the last 5 harvest jobs with the same versions
+* If the specified `ingest_config` is disabled and `force` or `ignore_disabled` is not set, crash
+* For the given `ingest_config` find up to the last 5 harvest jobs with the same harvester versions
 * If they are all failed, throw an exception (Refuse to run)
 
 ### Setup
-* Lock the `source` (NOWAIT)
+* Lock the `ingest_config` (NOWAIT)
   * On failure, reschedule for a later run. (This should be allowed to happen many times before finally failing)
-* Get or create HarvestJob(source_id, version, harvester, date ranges...)
+* Get or create HarvestLog(`ingest_config_id`, `harvester_version`, `start_date`, `end_date`)
   * if found and status is:
     * `SUCCEEDED`, `SPLIT`, or `FAILED`: update timestamps and/or counts.
-    * STARTED: Log a warning (Should not have been able to lock the source) and update timestamps and/or counts.
-* Set HarvestJob status to `STARTED`
+    * `STARTED`: Log a warning (Should not have been able to lock the `ingest_config`) and update timestamps and/or counts.
+* Set HarvestLog status to `STARTED`
 * If the specified date range is >= [SOME LENGTH OF TIME] and `no_split` is False
   * Chunk the date range and spawn a harvest task for each chunk
   * Set status to `SPLIT` and exit
-* Load the harvester for the given source
+* Load the harvester for the given `ingest_config`
 
 ### Actually Harvest
-* Harvest data between the specified datetimes, respecting `limit` and `rate_limit`
+* Harvest data between the specified datetimes, respecting `limit` and `ingest_config.rate_limit`
 
 ### Pass the data along
 * Begin catching any exceptions
 * For each piece of data recieved (Perferably in bulk/chunks)
-  * Get or create SourceUniqueIdentifier(suid, source_id)
+  * Get or create `SourceUniqueIdentifier(suid, source_id)`
+    * Question: Should SUIDs depend on `ingest_config_id` instead of `source_id`? If we're harvesting data in multiple formats from the same source, we probably want to keep the respective states separate.
   * Get or create RawData(hash, suid)
 * For each piece of data (After saving to keep as transactional as possible)
-  * Get or create TransformLogs(raw_id, version)
+  * Get or create `TransformLog(raw_id, ingest_config_id, transformer_version)`
   * if the log already exists and superfluous is not set, exit
-  * Start the transform task(raw_id, version) unless `transform` is `False`
+  * Start the `TransformTask(raw_id, ingest_config_id)` unless `transform` is `False`
 
 ### Clean up
 * If an exception was caught, set status to `FAILED` and insert the exception/traceback
diff --git a/whitepapers/tasks/Transform.md b/whitepapers/tasks/Transform.md
@@ -2,15 +2,15 @@
 
 
 ## Responsibilities
-* Parsing data using source specific parsers
+* Parsing data using source-specific parsers
 * Applying global cleaners to the data
-* Catching any extranious exceptions and storing them in the ProcessLog and marking the ProcessLog as failed
+* Catching any extraneous exceptions, storing them in the TransformLog, and marking the TransformLog `FAILED`
 
 
 ## Parameters
 * `raw_id` -- 
-* `processor_version` --
-* `cleaner_version` --
+* `transformer_version` --
+* `regulator_version` --
 * `superfluous` --
 
 
@@ -19,23 +19,23 @@
 ### Setup
 * Load RawData by id.
   * Crash, if not found.
-* If not defined set `processor_version` to the latest.
-* If not defined set `cleaner_version` to the latest.
-* Find and lock ProcessLog(`raw_id`, `processor_version`) (SELECT FOR UPDATE NOWAIT)
+* If not defined set `transformer_version` to the latest.
+* If not defined set `regulator_version` to the latest.
+* Find and lock TransformLog(`raw_id`, `transformer_version`) (SELECT FOR UPDATE NOWAIT)
   * If not found, log an error. Create, Commit, Lock.
-    * If the create fails, Log an error and exit.
+    * If the create fails, log an error and exit.
   * If the lock times out/isn't granted. Log an error and exit.
-* If the found ProcessLog's status is finished/done and `superfluous` is `False` exit.
-* Set the status of the ProcessLog to in-progress
+* If the found TransformLog's status is `SUCCEEDED` and `superfluous` is `False` exit.
+* Set the status of the TransformLog to `STARTED`
 
 ### Check for racing
-* Search for any equivilent RawData (`document_id`, `source_id`) with an earlier timestamp that has not finished processing
-  * If found set status to rescheduled and exit
+* Search for any equivalent RawData (`document_id`, `source_id`) with an earlier timestamp that has not finished transfoming
+  * If found set status to `RESCHEDULED` and exit
 
-### Actually process the data
+### Actually transform the data
 * Start a transaction
-* Load the processor
-* Process data
+* Load the transformer
+* Transform data
 * Load the cleaning suite
 * Clean data
 
@@ -52,4 +52,4 @@
 * Commit transaction
 * Release all locks
 * Start disambiguation tasks for updated states
-* Set ProcessLog status to Done
+* Set TransformLog status to `SUCCEEDED`