This project simulates multi-channel sensor data and stores it as Parquet files for later analysis.
Each generated file contains multiple rows of sensor readings, where each row includes:
- Timestamp: UTC timestamp of the reading
- 5 Channels:
channel_0throughchannel_4containing random float values
The simulator is controlled by the following parameters:
--frequency-hertz(default: 1): Sampling frequency in Hertz (positive integer)--sending-rate-seconds(default: 30): Sending rate in seconds (positive integer)--n-files(default: None): Number of files to generate (None = infinite)--directory(default: ./simulated_data): Output directory for Parquet files--realtime/--no-realtime(default: realtime): Enable or disable real-time simulation with delays
The number of entries per file is derived automatically as:
messages_per_file = frequency_hertz * sending_rate_seconds
By default, the simulator generates files with 30 messages at 1 Hz, which equals a 30-second collection window.
Build the image from the repository root:
docker build -t sensor-data-simulate .Show CLI help via Docker:
docker run --rm sensor-data-simulate --helpRun a finite simulation, storing sensor data in a host directory.
mkdir simulated_data
docker run --rm `
-v "${PWD}/simulated_data:/app/simulated_data" `
sensor-data-simulate `
--n-files 10 `
--frequency-hertz 2mkdir -p simulated_data
docker run --rm \
-v "$(pwd)/simulated_data:/app/simulated_data" \
sensor-data-simulate \
--n-files 10 \
--frequency-hertz 2Run an infinite, real-time simulation and stop it with Ctrl+C or a
SIGTERM (same volume mapping applies, only the arguments change), e.g. on
Linux/macOS:
docker run --rm \
-v "$(pwd)/simulated_data:/app/simulated_data" \
sensor-data-simulate --realtimeThe container's entrypoint is sensor-data-simulate, so any additional arguments
after the image name are passed directly to the simulator.
This project uses Poetry.
-
Make sure Poetry is installed.
-
From the repository root, install dependencies:
poetry install
-
Activate the environment when running commands:
poetry run sensor-data-simulate --help
If you prefer not to use Poetry, you can instead create a virtual environment of your choice and install the project with pip from the repository root (for example on Windows PowerShell):
python -m venv .venv
.venv\\Scripts\\Activate.ps1
pip install .
sensor-data-simulate --helpThe CLI is exposed via the sensor-data-simulate script (configured in
[tool.poetry.scripts] in pyproject.toml).
Basic help:
poetry run sensor-data-simulate --help-
--directory PATH- Directory where simulated Parquet files are written.
- Default:
./simulated_data(created if it does not exist).
-
--frequency-hertz INT- Sampling frequency of the generated sensor readings in Hz.
- Default:
1. - Must be a positive integer.
-
--sending-rate-seconds INT- Time interval between file writes in seconds.
- Default:
30. - Must be a positive integer.
-
messages-per-file(derived)- Computed as
frequency-hertz * sending-rate-seconds. - Example:
2 Hz * 30 s = 60entries per file.
- Computed as
-
--n-files INT- Total number of files to generate.
- Default:
None→ run indefinitely until interrupted withCtrl+C.
-
--realtime/--no-realtime- Default behavior is realtime mode.
- Use
--no-realtimeto write files as fast as possible.
-
--log-level LEVEL- Logging verbosity:
DEBUG,INFO,WARNING,ERROR. - Default:
INFO.
- Logging verbosity:
Generate 100 files with default settings (30 messages at 1 Hz each):
poetry run sensor-data-simulate --n-files 100Generate 50 files with 2 Hz sampling frequency and 60 messages per file:
poetry run sensor-data-simulate --n-files 50 --frequency-hertz 2Generate files indefinitely with real-time delays (30 seconds between files):
poetry run sensor-data-simulateGenerate data to a custom directory:
poetry run sensor-data-simulate --directory data/sensor_logs --n-files 10Stop an infinite simulation with Ctrl+C.
Each simulation run generates a Parquet file in the target directory. Filenames follow the pattern:
sensor_data_{YYYYMMDD}T{HHMMSS}_{uuid}.parquet
Each file contains multiple rows of sensor readings with the following columns:
Timestamp–pandas.Timestamp(UTC) of the reading.channel_0– float, random sensor value for channel 0.channel_1– float, random sensor value for channel 1.channel_2– float, random sensor value for channel 2.channel_3– float, random sensor value for channel 3.channel_4– float, random sensor value for channel 4.
You can load and analyze the data with pandas:
import pandas as pd
df = pd.read_parquet("simulated_data/sensor_data_20260416T153707_....parquet")
timestamp = df.loc[0, "Timestamp"]
channel_0 = df.loc[0, "channel_0"]For downstream analytics or stream-processing pipelines, you can
summarize each sensor channel into a small set of statistics using
extract_channel_statistics from mkp.sensor_data.simulate.features:
from mkp.sensor_data.simulate.features import extract_channel_statistics
features = extract_channel_statistics(df["channel_0"].to_numpy())The returned dictionary contains mean, standard deviation, minimum and maximum values and is suitable for feeding into streaming pipelines or online monitoring dashboards.
- Format/lint checks are configured via Ruff.
- Tests are run with pytest:
poetry run pytestAdjust or extend the simulator logic in mkp/sensor_data/simulate/simulate.py and
the CLI in mkp/sensor_data/simulate/main.py.