You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is the Python edition of [RumbleDB](https://rumbledb.org/), which brings [JSONiq](https://www.jsoniq.org) to the world of Spark and DataFrames. JSONiq is a language considerably more powerful than SQL as it can process [messy, heterogeneous datasets](https://arxiv.org/abs/1910.11582), from kilobytes to Petabytes, with very little coding effort.
5
+
This is the Python edition of [RumbleDB](https://rumbledb.org/), which brings [JSONiq](https://www.jsoniq.org) to the world of Python.
6
6
7
-
The Python edition of RumbleDB is currently only a prototype (alpha) and probably unstable. We welcome bug reports.
7
+
JSONiq is a language considerably more powerful than SQL as it can process [messy, heterogeneous datasets](https://arxiv.org/abs/1910.11582), from kilobytes to Petabytes, with very little coding effort.
8
+
9
+
Spark aficionados can also pass DataFrames to JSONiq queries and take back DataFrames. This gives them an environment in which both Spark SQL and JSONiq co-exist to manipulate the data.
10
+
11
+
The Python edition of RumbleDB is currently a prototype (alpha) and probably unstable. We welcome bug reports and feedback.
8
12
9
13
## About RumbleDB
10
14
@@ -24,20 +28,33 @@ A RumbleSession is a wrapper around a SparkSession that additionally makes sure
24
28
25
29
JSONiq queries are invoked with rumble.jsoniq() in a way similar to the way Spark SQL queries are invoked with spark.sql().
26
30
27
-
Any number of Python DataFrames can be attached to external JSONiq variables used in the query. It will later also be possible to read tables registered in the Hive metastore, similar to spark.sql(). Alternatively, the JSONiq query can also read many files of many different formats from many places (local drive, HTTP, S3, HDFS, ...) directly with simple builtin function calls such as json-lines(), text-file(), parquet-file(), csv-file(), etc. See [RumbleDB's documentation](https://rumble.readthedocs.io/en/latest/).
31
+
JSONiq variables can be bound to lists of JSON values (str, int, float, True, False, None, dict, list) or to Pyspark DataFrames. A JSONiq query can use as many variables as needed (for example, it can join between different collections).
28
32
29
-
The resulting sequence of items can be retrieved as DataFrame, as an RDD, as a Python list, or with a streaming iteration over the items.
33
+
It will later also be possible to read tables registered in the Hive metastore, similar to spark.sql(). Alternatively, the JSONiq query can also read many files of many different formats from many places (local drive, HTTP, S3, HDFS, ...) directly with simple builtin function calls such as json-lines(), text-file(), parquet-file(), csv-file(), etc. See [RumbleDB's documentation](https://rumble.readthedocs.io/en/latest/).
30
34
31
-
The individual items can be processed using the RumbleDB [Item API](https://github.com/RumbleDB/rumble/blob/master/src/main/java/org/rumbledb/api/Item.java).
35
+
The resulting sequence of items can be retrieved as a list of JSON values, as a Pyspark DataFrame, or, for advanced users, as an RDD or with a streaming iteration over the items using the [RumbleDB Item API](https://github.com/RumbleDB/rumble/blob/master/src/main/java/org/rumbledb/api/Item.java).
32
36
33
-
Alternatively, it is possible to directly get a Python list of JSON values, or a streaming iteration of JSON values. This is a convenience that makes it unnecessary to use the Item API, especially for a first-time user.
37
+
It is also possible to write the sequence of items to the local disk, to HDFS, to S3, etc in a way similar to how DataFrames are written back by Pyspark.
34
38
35
-
It is also possible to write the sequence of items to the local disk, to HDFS, to S3, etc in a way similar to how DataFrames are written back by Spark.
36
-
37
-
The design goal is that it should be possible to chain DataFrames between JSONiq and Spark SQL queries seamlessly. For example, JSONiq can be used to clean up very messy data and turn it into a clean DataFrame, which can then be processed with Spark SQL, spark.ml, etc.
39
+
The design goal is that it is possible to chain DataFrames between JSONiq and Spark SQL queries seamlessly. For example, JSONiq can be used to clean up very messy data and turn it into a clean DataFrame, which can then be processed with Spark SQL, spark.ml, etc.
38
40
39
41
Any feedback or error reports are very welcome.
40
42
43
+
## Type mapping
44
+
45
+
When passing Python values to JSONiq or getting them from a JSONiq queries, the mapping is as follows:
46
+
47
+
| Python | JSONiq |
48
+
|-------|-------|
49
+
|dict|object|
50
+
|list|array|
51
+
|str|string|
52
+
|int|integer|
53
+
|bool|boolean|
54
+
|None|null|
55
+
56
+
Furthermore, other JSONiq types will be mapped to string literals. Users who want to preserve JSONiq types can use the Item API instead.
57
+
41
58
## Installation
42
59
43
60
Install with
@@ -49,7 +66,7 @@ pip install jsoniq
49
66
50
67
## Sample code
51
68
52
-
We will make more documentation available as we go. In the meantime, you will find a sample code below that should just run
69
+
We will make more documentation available as we go. In the meantime, you will find a sample, commented code below that should just run
53
70
after installing the library.
54
71
55
72
You can directly copy paste the code below to a Python file and execute it with Python.
@@ -60,15 +77,145 @@ from jsoniq import RumbleSession
60
77
# The syntax to start a session is similar to that of Spark.
61
78
# A RumbleSession is a SparkSession that additionally knows about RumbleDB.
62
79
# All attributes and methods of SparkSession are also available on RumbleSession.
Even more queries can be found [here](https://colab.research.google.com/github/RumbleDB/rumble/blob/master/RumbleSandbox.ipynb) and you can look at the [JSONiq documentation](https://www.jsoniq.org) and tutorials.
225
324
226
325
# Last updates
227
326
327
+
## Version 0.1.0 alpha 12
328
+
- Allow to bind JSONiq variables to Python values (mapping Python lists to sequences of items). This makes it possible to manipulate Python values directly with JSONiq and even without any knowledge of Spark at all.
329
+
- renamed bindDataFrameAsVariable() to bind(), which can be used both with DataFrames and Python lists.
330
+
- add bindOne() for binding a single value to a JSONiq variable.
331
+
- wrapping df() in a Pyspark DataFrame to make sure it can be used with pyspark DataFrame transformations.
332
+
228
333
## Version 0.1.0 alpha 11
229
334
- Fix an issue when feeding a DataFrame output by rumble.jsoniq() back to a new JSONiq query (as a variable).
0 commit comments