A highly customizable event data generator, created by the team at Imply.
The data generator requires Python 3.
Create and activate a local virtual environment, then install dependencies:
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txtRun the following example to test the generator script:
python generator.py -c presets/configs/ecommerce.json -t access_combined -m 1 -n 10This command generates logs in the format of Apache access combined logs. It uses a single worker to generate 10 records, and it outputs the results to the standard output stream, such as the terminal window. Status messages are written to stderr, so stdout contains only data and can be piped directly.
For more examples and test cases, see test.sh.
The presets/ folder contains ready-to-use configs with embedded output templates — use -t to select an output format by name. See presets/README.md for details.
Building your own config? Start here:
- How to build a config — step-by-step from concept to tested config, with a worked example
- Common patterns — variable persistence, multi-record sessions, flow duration
- Best practices — naming conventions, the synthetic clock, common pitfalls
Reference — field-level lookup for all config options:
- States — all five state types and their fields
- Emitters — record output configuration
- Field generators — all field generator types
- Distributions — uniform, exponential, normal, gmm_temporal
- Templates — Jinja2 output templates
- Schedules — time-of-day traffic variation
- Deterministic output — reproducible generation with
--seed
Run the generator.py script from the command line with Python.
python generator.py \
-c <generator configuration file> \
-t <template name> \
-f <format file> \
-s <start timestamp> \
-m <generator workers limit> \
-n <record limit> \
-r <duration limit in ISO8610 format> \
--schedule <schedule file> \
--debug \
--seed <integer>| Argument | Description |
|---|---|
-c |
Path to the generator configuration JSON file. See generator configuration reference. |
-t / --template |
A named output template embedded in the generator config. See output templates. |
-s |
Use a simulated clock starting at the specified ISO time, rather than using the system clock. This will cause records to be produced instantaneously (batch) rather than with a real clock (real-time). |
-m |
The maximum number of workers to create. Defaults to 100. |
-n |
The number of records to generate. Must not be used in combination with -r. |
-r |
The length of time to create records for, expressed in ISO8601 format. Must not be used in combination with -n. |
--schedule |
A JSON file that modulates the number of active workers over time, producing time-of-day traffic variation. See the schedule documentation for available schedules and how to write your own. |
--debug |
Enable debug logging. Outputs detailed thread scheduling and event queue information to stderr. |
--seed |
An integer seed for deterministic data generation. Use with -s for fully reproducible output. |
The generator configuration is a JSON document passed via -c. It contains two top-level arrays:
{
"states": [ ... ],
"emitters": [ ... ]
}- A list of
statesthat each worker traverses. The first state controls interarrival pacing; subsequent states set variables, emit records, route between paths, and terminate. - A list of
emittersthat define output record shape. Each dimension uses a field generator to produce values, controlled by distributions.
Each concurrent worker (-m) runs one independent Actor — one lifecycle from the initial event:start:timer to event:end. For the full design process, see how to build a config.
Configs that include a templates block (such as those in presets/configs/) support named output templates selected with --template. Templates use Jinja2 and can produce JSON, CSV, NCSA combined logs, and more from a single config. See the output templates reference.
Use -n to stop after a number of records, or -r to stop after a duration (ISO 8601). If neither is set, the generator runs indefinitely.
# 1000 records
python generator.py -c presets/configs/ecommerce.json -t apache:access:json -n 1000
# One hour of data
python generator.py -c presets/configs/ecommerce.json -t apache:access:json -r PT1HBy default, timestamps reflect the real system clock. Use -s to start a synthetic clock at a fixed point in time — records are produced instantly rather than in real time, which is recommended for generating large volumes of historical data.
# 1000 records starting 1 Jan 2025
python generator.py -c presets/configs/ecommerce.json -t apache:access:json -n 1000 -s "2025-01-01T00:00"
# One hour of data starting 1 Jan 2025
python generator.py -c presets/configs/ecommerce.json -t apache:access:json -r PT1H -s "2025-01-01T00:00"The generator always writes to stdout. Pipe it to whatever destination you need.
The default — useful for inspection or piping to other tools:
python generator.py -c presets/configs/ecommerce.json -t apache:access:json -n 100Redirect stdout to a file:
python generator.py -c presets/configs/ecommerce.json -t apache:access:json -n 1000 > events.jsonPipe to kcat:
python generator.py -c presets/configs/ecommerce.json -t apache:access:json \
| kcat -b localhost:9092 -t my-topicUse kcat with SASL authentication:
python generator.py -c presets/configs/ecommerce.json -t apache:access:json \
| kcat -b pkc-example.us-east-1.aws.confluent.cloud:9092 \
-X security.protocol=SASL_SSL \
-X sasl.mechanisms=PLAIN \
-X sasl.username="$CONFLUENT_API_KEY" \
-X sasl.password="$CONFLUENT_API_SECRET" \
-t my-topicWhen the endpoint is able to apply metadata (e.g. sourcetype, index, and host), pipe to services/collector/raw:
python generator.py -c presets/configs/ecommerce.json -t access_combined \
| curl -s -X POST https://hec.example.com/services/collector/raw \
-H "Authorization: Splunk $HEC_TOKEN" \
--data-binary @-For full control over metadata, use a pipeline tool that wraps each event in a HEC envelope — an OTel Collector with a Splunk HEC exporter, or Cribl or Vector.