Skip to content

Commit 74038fd

Browse files
Update README.md
1 parent d0adcf5 commit 74038fd

1 file changed

Lines changed: 25 additions & 7 deletions

File tree

README.md

Lines changed: 25 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,33 @@
22

33
by Abishek Ramdas and Ghislain Fourny
44

5-
This is the Python version of RumbleDB. It is currently only a prototype (alpha) and probably unstable.
5+
This is the Python edition of [RumbleDB](https://rumbledb.org/), which brings [JSONiq](https://www.jsoniq.org) to the world of Spark and DataFrames. JSONiq is a language considerably more powerful than SQL as it can process [messy, heterogeneous datasets](https://arxiv.org/abs/1910.11582), from kilobytes to Petabytes, with very little coding effort.
6+
7+
The Python edition of RumbleDB is currently only a prototype (alpha) and probably unstable.
8+
9+
## High-level information
10+
11+
A RumbleSession is a wrapper around a SparkSession that additionally makes sure the RumbleDB environment is in scope.
12+
13+
JSONiq queries are invoked with rumble.jsoniq() in a way similar to the way Spark SQL queries are invoked with spark.sql().
14+
15+
Any number of Python DataFrames can be attached to JSONiq variables used in the query. It will later also possible to read tables registered in the Hive metastore, similar to spark.sql(). Alternatively, the JSONiq query can also read many files of many different formats from many places (local drive, HTTP, S3, HDFS, ...) directly with simple builtin function calls (see [RumbleDB's documentation](https://rumble.readthedocs.io/en/latest/)).
16+
17+
The resulting sequence of items can be retrieved as DataFrame, as an RDD, as a Python list, or with a streaming iteration over the items.
18+
19+
The individual items can be processed using the RumbleDB [Item API](https://github.com/RumbleDB/rumble/blob/master/src/main/java/org/rumbledb/api/Item.java).
20+
21+
Alternatively, it is possible to directly get an RDD of Python-friendly JSON values, or a Python list of JSON values, or a streaming iteration of JSON values. This is a convenience that makes it unnecessary to use the Item API, especially for a first-time user.
22+
23+
## Installation
624

725
Install with
826
```
927
pip install jsoniq
1028
```
1129

12-
Sample code:
30+
## Sample code
31+
1332
```
1433
from jsoniq import RumbleSession
1534
@@ -69,9 +88,8 @@ while(res.hasNext()):
6988
print(res.nextJSON());
7089
res.close();
7190
72-
# There is still a problem to solve to make RDDs work across Python and Java
73-
#rdd = res.getAsJSONRDD();
74-
#print(rdd.count());
75-
#for str in rdd.take(10):
76-
# print(str);
91+
# This gets an RDD of JSON values that can be processed by Python rdd = res.getAsJSONRDD();
92+
print(rdd.count());
93+
for str in rdd.take(10):
94+
print(str);
7795
```

0 commit comments

Comments
 (0)