Skip to content

Commit 229625f

Browse files
author
Ghislain Fourny
committed
Merge branch 'main' of github.com:RumbleDB/python-jsoniq
2 parents c545db7 + f31c7d9 commit 229625f

1 file changed

Lines changed: 29 additions & 7 deletions

File tree

README.md

Lines changed: 29 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,37 @@
22

33
by Abishek Ramdas and Ghislain Fourny
44

5-
This is the Python version of RumbleDB. It is currently only a prototype (alpha) and probably unstable.
5+
This is the Python edition of [RumbleDB](https://rumbledb.org/), which brings [JSONiq](https://www.jsoniq.org) to the world of Spark and DataFrames. JSONiq is a language considerably more powerful than SQL as it can process [messy, heterogeneous datasets](https://arxiv.org/abs/1910.11582), from kilobytes to Petabytes, with very little coding effort.
6+
7+
The Python edition of RumbleDB is currently only a prototype (alpha) and probably unstable.
8+
9+
## High-level information
10+
11+
A RumbleSession is a wrapper around a SparkSession that additionally makes sure the RumbleDB environment is in scope.
12+
13+
JSONiq queries are invoked with rumble.jsoniq() in a way similar to the way Spark SQL queries are invoked with spark.sql().
14+
15+
Any number of Python DataFrames can be attached to JSONiq variables used in the query. It will later also possible to read tables registered in the Hive metastore, similar to spark.sql(). Alternatively, the JSONiq query can also read many files of many different formats from many places (local drive, HTTP, S3, HDFS, ...) directly with simple builtin function calls (see [RumbleDB's documentation](https://rumble.readthedocs.io/en/latest/)).
16+
17+
The resulting sequence of items can be retrieved as DataFrame, as an RDD, as a Python list, or with a streaming iteration over the items.
18+
19+
The individual items can be processed using the RumbleDB [Item API](https://github.com/RumbleDB/rumble/blob/master/src/main/java/org/rumbledb/api/Item.java).
20+
21+
Alternatively, it is possible to directly get an RDD of Python-friendly JSON values, or a Python list of JSON values, or a streaming iteration of JSON values. This is a convenience that makes it unnecessary to use the Item API, especially for a first-time user.
22+
23+
The design goal is that it should be possible to chain DataFrames between JSONiq and Spark SQL queries seamlessly. For example, JSONiq can be used to clean up very messy data and turn it into a clean DataFrame, which can then be processed with Spark SQL, spark.ml, etc.
24+
25+
Any feedback or error reports are very welcome.
26+
27+
## Installation
628

729
Install with
830
```
931
pip install jsoniq
1032
```
1133

12-
Sample code:
34+
## Sample code
35+
1336
```
1437
from jsoniq import RumbleSession
1538
@@ -69,9 +92,8 @@ while(res.hasNext()):
6992
print(res.nextJSON());
7093
res.close();
7194
72-
# There is still a problem to solve to make RDDs work across Python and Java
73-
#rdd = res.getAsJSONRDD();
74-
#print(rdd.count());
75-
#for str in rdd.take(10):
76-
# print(str);
95+
# This gets an RDD of JSON values that can be processed by Python rdd = res.getAsJSONRDD();
96+
print(rdd.count());
97+
for str in rdd.take(10):
98+
print(str);
7799
```

0 commit comments

Comments
 (0)