|
2 | 2 |
|
3 | 3 | by Abishek Ramdas and Ghislain Fourny |
4 | 4 |
|
5 | | -This is the Python version of RumbleDB. It is currently only a prototype (alpha) and probably unstable. |
| 5 | +This is the Python edition of [RumbleDB](https://rumbledb.org/), which brings [JSONiq](https://www.jsoniq.org) to the world of Spark and DataFrames. JSONiq is a language considerably more powerful than SQL as it can process [messy, heterogeneous datasets](https://arxiv.org/abs/1910.11582), from kilobytes to Petabytes, with very little coding effort. |
| 6 | + |
| 7 | +The Python edition of RumbleDB is currently only a prototype (alpha) and probably unstable. |
| 8 | + |
| 9 | +## High-level information |
| 10 | + |
| 11 | +A RumbleSession is a wrapper around a SparkSession that additionally makes sure the RumbleDB environment is in scope. |
| 12 | + |
| 13 | +JSONiq queries are invoked with rumble.jsoniq() in a way similar to the way Spark SQL queries are invoked with spark.sql(). |
| 14 | + |
| 15 | +Any number of Python DataFrames can be attached to JSONiq variables used in the query. It will later also possible to read tables registered in the Hive metastore, similar to spark.sql(). Alternatively, the JSONiq query can also read many files of many different formats from many places (local drive, HTTP, S3, HDFS, ...) directly with simple builtin function calls (see [RumbleDB's documentation](https://rumble.readthedocs.io/en/latest/)). |
| 16 | + |
| 17 | +The resulting sequence of items can be retrieved as DataFrame, as an RDD, as a Python list, or with a streaming iteration over the items. |
| 18 | + |
| 19 | +The individual items can be processed using the RumbleDB [Item API](https://github.com/RumbleDB/rumble/blob/master/src/main/java/org/rumbledb/api/Item.java). |
| 20 | + |
| 21 | +Alternatively, it is possible to directly get an RDD of Python-friendly JSON values, or a Python list of JSON values, or a streaming iteration of JSON values. This is a convenience that makes it unnecessary to use the Item API, especially for a first-time user. |
| 22 | + |
| 23 | +The design goal is that it should be possible to chain DataFrames between JSONiq and Spark SQL queries seamlessly. For example, JSONiq can be used to clean up very messy data and turn it into a clean DataFrame, which can then be processed with Spark SQL, spark.ml, etc. |
| 24 | + |
| 25 | +Any feedback or error reports are very welcome. |
| 26 | + |
| 27 | +## Installation |
6 | 28 |
|
7 | 29 | Install with |
8 | 30 | ``` |
9 | 31 | pip install jsoniq |
10 | 32 | ``` |
11 | 33 |
|
12 | | -Sample code: |
| 34 | +## Sample code |
| 35 | + |
13 | 36 | ``` |
14 | 37 | from jsoniq import RumbleSession |
15 | 38 |
|
@@ -69,9 +92,8 @@ while(res.hasNext()): |
69 | 92 | print(res.nextJSON()); |
70 | 93 | res.close(); |
71 | 94 |
|
72 | | -# There is still a problem to solve to make RDDs work across Python and Java |
73 | | -#rdd = res.getAsJSONRDD(); |
74 | | -#print(rdd.count()); |
75 | | -#for str in rdd.take(10): |
76 | | -# print(str); |
| 95 | +# This gets an RDD of JSON values that can be processed by Python rdd = res.getAsJSONRDD(); |
| 96 | +print(rdd.count()); |
| 97 | +for str in rdd.take(10): |
| 98 | + print(str); |
77 | 99 | ``` |
0 commit comments