A distributed SQL query engine with columnar storage and parallel query execution. This project demonstrates a coordinator-worker architecture capable of parsing, planning, distributing, executing and aggregating SQL queries across multiple nodes.
- Columnar storage for efficient reading of large datasets.
- Supports
SELECT,WHERE, aggregates (SUM,COUNT,AVG,MIN,MAX), andGROUP BY. - Query planner for distributing tasks among workers.
- Fault tolerance with automatic redistribution of failed workers' segments.
- Aggregation of partial results from workers.
- Extensible design for future SQL features.
+----------------+
| Coordinator |
+----------------+
|
| HTTP /query
v
+----------------+
| Query Parser |
+----------------+
|
v
+----------------+
| Planner |
+----------------+
|
v
+----------------+ +----------------+
| Dispatcher |------->| Worker 1 |
+----------------+ +----------------+
| | Segments + Task
| v
| Executes Task
| |
| v
| Returns Result
|
v
+----------------+
| Aggregator |
+----------------+
|
v
Query Result (HTTP Response)
- Clone the repository:
git clone https://github.com/Kallistina/distributed-SQL-query-engine.git
cd distributed-sql-engine/minidist- Initialize a new table with schema:
minidist init data/<table_name> --schema <schema_file>- Load CSV data into table with sorting and segmentation:
minidist load data/<table_name> --csv <csv_file> --sort-key <column> --segments <n>- Show table schema and metadata:
minidist show schema data/<table_name>
minidist show info data/<table_name>- Start workers:
python3 -m worker.py --port 9001 --data_dir data/sales
python3 -m worker.py --port 9002 --data_dir data/sales- Start coordinator:
python3 -m coordinator.py --workers 9001 9002- Query:
curl -X POST localhost:8080/query -d "query=$QUERY"