Skip to content

Commit a9a357e

Browse files
author
Ghislain Fourny
committed
Improve and add documentation.
1 parent d31e207 commit a9a357e

4 files changed

Lines changed: 59 additions & 25 deletions

File tree

README.md

Lines changed: 55 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -4,21 +4,35 @@ by Abishek Ramdas and Ghislain Fourny
44

55
This is the Python edition of [RumbleDB](https://rumbledb.org/), which brings [JSONiq](https://www.jsoniq.org) to the world of Spark and DataFrames. JSONiq is a language considerably more powerful than SQL as it can process [messy, heterogeneous datasets](https://arxiv.org/abs/1910.11582), from kilobytes to Petabytes, with very little coding effort.
66

7-
The Python edition of RumbleDB is currently only a prototype (alpha) and probably unstable.
7+
The Python edition of RumbleDB is currently only a prototype (alpha) and probably unstable. We welcome bug reports.
88

9-
## High-level information
9+
## About RumbleDB
10+
11+
RumbleDB is a JSONiq engine that works both with very small amounts of data and very large amounts of data.
12+
It works with JSON, CSV, text, Parquet, etc (and soon XML).
13+
It works on your laptop as well as on any Spark cluster (AWS, company clusters, etc).
14+
15+
It automatically detects and switches between execution modes in a way transparent to the user, bringing the convenience of data independence to the world of messy data.
16+
17+
It is an academic project, natively in Java, carried out at ETH Zurich by many students over more than 8 years: Stefan Irimescu, Renato Marroquin, Rodrigo Bruno, Falko Noé, Ioana Stefan, Andrea Rinaldi, Stevan Mihajlovic, Mario Arduini, Can Berker Çıkış, Elwin Stephan, David Dao, Zirun Wang, Ingo Müller, Dan-Ovidiu Graur, Thomas Zhou, Olivier Goerens, Alexandru Meterez, Pierre Motard, Remo Röthlisberger, Dominik Bruggisser, David Loughlin, David Buzatu, Marco Schöb, Maciej Byczko, Matteo Agnoletto, Dwij Dixit.
18+
19+
It is free and open source, under an Apache 2.0 license, which can also be used commercially (but on an as-is basis with no guarantee).
20+
21+
## High-level information on the library
1022

1123
A RumbleSession is a wrapper around a SparkSession that additionally makes sure the RumbleDB environment is in scope.
1224

1325
JSONiq queries are invoked with rumble.jsoniq() in a way similar to the way Spark SQL queries are invoked with spark.sql().
1426

15-
Any number of Python DataFrames can be attached to JSONiq variables used in the query. It will later also possible to read tables registered in the Hive metastore, similar to spark.sql(). Alternatively, the JSONiq query can also read many files of many different formats from many places (local drive, HTTP, S3, HDFS, ...) directly with simple builtin function calls (see [RumbleDB's documentation](https://rumble.readthedocs.io/en/latest/)).
27+
Any number of Python DataFrames can be attached to external JSONiq variables used in the query. It will later also be possible to read tables registered in the Hive metastore, similar to spark.sql(). Alternatively, the JSONiq query can also read many files of many different formats from many places (local drive, HTTP, S3, HDFS, ...) directly with simple builtin function calls such as json-lines(), text-file(), parquet-file(), csv-file(), etc. See [RumbleDB's documentation](https://rumble.readthedocs.io/en/latest/).
1628

1729
The resulting sequence of items can be retrieved as DataFrame, as an RDD, as a Python list, or with a streaming iteration over the items.
1830

1931
The individual items can be processed using the RumbleDB [Item API](https://github.com/RumbleDB/rumble/blob/master/src/main/java/org/rumbledb/api/Item.java).
2032

21-
Alternatively, it is possible to directly get an RDD of Python-friendly JSON values, or a Python list of JSON values, or a streaming iteration of JSON values. This is a convenience that makes it unnecessary to use the Item API, especially for a first-time user.
33+
Alternatively, it is possible to directly get a Python list of JSON values, or a streaming iteration of JSON values. This is a convenience that makes it unnecessary to use the Item API, especially for a first-time user.
34+
35+
It is also possible to write the sequence of items to the local disk, to HDFS, to S3, etc in a way similar to how DataFrames are written back by Spark.
2236

2337
The design goal is that it should be possible to chain DataFrames between JSONiq and Spark SQL queries seamlessly. For example, JSONiq can be used to clean up very messy data and turn it into a clean DataFrame, which can then be processed with Spark SQL, spark.ml, etc.
2438

@@ -33,10 +47,15 @@ pip install jsoniq
3347

3448
## Sample code
3549

50+
We will make more documentation available as we go. In the meantime, you will find a sample code below that should just run
51+
after installing the library.
52+
3653
```
3754
from jsoniq import RumbleSession
3855
39-
# The syntax to start a session is similar to Spark.
56+
# The syntax to start a session is similar to that of Spark.
57+
# A RumbleSession is a SparkSession that additionally knows about RumbleDB.
58+
# All attributes and methods of SparkSession are also available on RumbleSession.
4059
rumble = RumbleSession.builder.appName("PyRumbleExample").getOrCreate();
4160
4261
# Create a data frame also similar to Spark (but using the rumble object).
@@ -47,29 +66,32 @@ df = rumble.createDataFrame(data, columns);
4766
# This is how to bind a JSONiq variable to a dataframe. You can bind as many variables as you want.
4867
rumble.bindDataFrameAsVariable('$a', df);
4968
50-
# This is how to run a query (declaring the external variable). This is similar to spark.sql().
51-
res = rumble.jsoniq('declare variable $a external; $a.Name');
52-
53-
# returns a list containing one or several of "DataFrame", "RDD", "PUL", "Local"
69+
# This is how to run a query. This is similar to spark.sql().
70+
# Since variable $a was bound to a DataFrame, it is automatically declared as an external variable
71+
# and can be used in the query. In JSONiq, it is logically a sequence of objects.
72+
res = rumble.jsoniq('$a.Name');
73+
74+
# There are several ways to collect the outputs, depending on the user needs but also
75+
# on the query supplied.
76+
# This returns a list containing one or several of "DataFrame", "RDD", "PUL", "Local"
77+
# If DataFrame is in the list, df() can be invoked.
78+
# If RDD is in the list, rdd() can be invoked.
79+
# If Local is the list, items() or json() can be invokved, as well as the local iterator API.
5480
modes = res.availableOutputs();
81+
for mode in modes:
82+
print(mode)
5583
5684
###### Parallel access ######
5785
58-
# This returns a regular data frame
59-
df = res.getAsDataFrame();
86+
# This returns a regular data frame that can be further processed with spark.sql() or rumble.jsoniq().
87+
df = res.df();
6088
df.show();
6189
62-
# This returns an RDD containing JSONiq item objects (does not work yet with transformations)
63-
rdd = res.getAsRDD();
64-
print(rdd.count());
65-
for item in rdd.take(10):
66-
print(item.getStringValue());
67-
6890
##### Local access ######
6991
7092
# This materializes the rows as items.
71-
# The items are access with the RumbleDB Item API.
72-
list = res.getAsList();
93+
# The items are accessed with the RumbleDB Item API.
94+
list = res.items();
7395
for result in list:
7496
print(result.getStringValue())
7597
@@ -82,7 +104,7 @@ res.close();
82104
###### Native Python/JSON Access for bypassing the Item API (but losing on the richer JSONiq type system) ######
83105
84106
# This method directly gets the result as JSON (dict, list, strings, ints, etc).
85-
jlist = res.getAsJSONList();
107+
jlist = res.json();
86108
for str in jlist:
87109
print(str);
88110
@@ -93,8 +115,20 @@ while(res.hasNext()):
93115
res.close();
94116
95117
# This gets an RDD of JSON values that can be processed by Python
96-
rdd = res.getAsJSONRDD();
118+
rdd = res.rdd();
97119
print(rdd.count());
98120
for str in rdd.take(10):
99121
print(str);
122+
123+
# It is also possible to write the output to a file locally or on a cluster. The API is similar to that of Spark dataframes.
124+
# Note that it creates a directory and stores the (potentially very large) output in a shared directory.
125+
# RumbleDB was already tested with up to 64 AWS machines and 100s of TBs of data.
126+
# Of course the examples below are so small that it makes more sense to process the results locally with Python,
127+
# but this shows how GBs or TBs of data obtained from JSONiq can be written back to disk.
128+
seq = rumble.jsoniq("$a.Name");
129+
seq.write().mode("overwrite").json("outputjson");
130+
seq.write().mode("overwrite").parquet("outputparquet");
131+
132+
seq = rumble.jsoniq("1+1");
133+
seq.write().mode("overwrite").text("outputtext");
100134
```

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
44

55
[project]
66
name = "jsoniq"
7-
version = "0.1.0a7"
7+
version = "0.1.0a8"
88
description = "Python edition of RumbleDB, a JSONiq engine"
99
requires-python = ">=3.11"
1010
dependencies = [
3.94 KB
Binary file not shown.

src/jsoniq/sequence.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,10 @@ def __init__(self, sequence, sparkcontext):
77
self._jsequence = sequence
88
self._sparkcontext = sparkcontext
99

10-
def getAsJSONList(self):
11-
return [json.loads(l.serializeAsJSON()) for l in self._jsequence.getAsList()]
10+
def json(self):
11+
return [json.loads(l.serializeAsJSON()) for l in self._jsequence.items()]
1212

13-
def getAsJSONRDD(self):
13+
def rdd(self):
1414
rdd = self._jsequence.getAsPickledStringRDD();
1515
rdd = RDD(rdd, self._sparkcontext)
1616
return rdd.map(lambda l: json.loads(l))

0 commit comments

Comments
 (0)