You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+55-21Lines changed: 55 additions & 21 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,21 +4,35 @@ by Abishek Ramdas and Ghislain Fourny
4
4
5
5
This is the Python edition of [RumbleDB](https://rumbledb.org/), which brings [JSONiq](https://www.jsoniq.org) to the world of Spark and DataFrames. JSONiq is a language considerably more powerful than SQL as it can process [messy, heterogeneous datasets](https://arxiv.org/abs/1910.11582), from kilobytes to Petabytes, with very little coding effort.
6
6
7
-
The Python edition of RumbleDB is currently only a prototype (alpha) and probably unstable.
7
+
The Python edition of RumbleDB is currently only a prototype (alpha) and probably unstable. We welcome bug reports.
8
8
9
-
## High-level information
9
+
## About RumbleDB
10
+
11
+
RumbleDB is a JSONiq engine that works both with very small amounts of data and very large amounts of data.
12
+
It works with JSON, CSV, text, Parquet, etc (and soon XML).
13
+
It works on your laptop as well as on any Spark cluster (AWS, company clusters, etc).
14
+
15
+
It automatically detects and switches between execution modes in a way transparent to the user, bringing the convenience of data independence to the world of messy data.
16
+
17
+
It is an academic project, natively in Java, carried out at ETH Zurich by many students over more than 8 years: Stefan Irimescu, Renato Marroquin, Rodrigo Bruno, Falko Noé, Ioana Stefan, Andrea Rinaldi, Stevan Mihajlovic, Mario Arduini, Can Berker Çıkış, Elwin Stephan, David Dao, Zirun Wang, Ingo Müller, Dan-Ovidiu Graur, Thomas Zhou, Olivier Goerens, Alexandru Meterez, Pierre Motard, Remo Röthlisberger, Dominik Bruggisser, David Loughlin, David Buzatu, Marco Schöb, Maciej Byczko, Matteo Agnoletto, Dwij Dixit.
18
+
19
+
It is free and open source, under an Apache 2.0 license, which can also be used commercially (but on an as-is basis with no guarantee).
20
+
21
+
## High-level information on the library
10
22
11
23
A RumbleSession is a wrapper around a SparkSession that additionally makes sure the RumbleDB environment is in scope.
12
24
13
25
JSONiq queries are invoked with rumble.jsoniq() in a way similar to the way Spark SQL queries are invoked with spark.sql().
14
26
15
-
Any number of Python DataFrames can be attached to JSONiq variables used in the query. It will later also possible to read tables registered in the Hive metastore, similar to spark.sql(). Alternatively, the JSONiq query can also read many files of many different formats from many places (local drive, HTTP, S3, HDFS, ...) directly with simple builtin function calls (see [RumbleDB's documentation](https://rumble.readthedocs.io/en/latest/)).
27
+
Any number of Python DataFrames can be attached to external JSONiq variables used in the query. It will later also be possible to read tables registered in the Hive metastore, similar to spark.sql(). Alternatively, the JSONiq query can also read many files of many different formats from many places (local drive, HTTP, S3, HDFS, ...) directly with simple builtin function calls such as json-lines(), text-file(), parquet-file(), csv-file(), etc. See [RumbleDB's documentation](https://rumble.readthedocs.io/en/latest/).
16
28
17
29
The resulting sequence of items can be retrieved as DataFrame, as an RDD, as a Python list, or with a streaming iteration over the items.
18
30
19
31
The individual items can be processed using the RumbleDB [Item API](https://github.com/RumbleDB/rumble/blob/master/src/main/java/org/rumbledb/api/Item.java).
20
32
21
-
Alternatively, it is possible to directly get an RDD of Python-friendly JSON values, or a Python list of JSON values, or a streaming iteration of JSON values. This is a convenience that makes it unnecessary to use the Item API, especially for a first-time user.
33
+
Alternatively, it is possible to directly get a Python list of JSON values, or a streaming iteration of JSON values. This is a convenience that makes it unnecessary to use the Item API, especially for a first-time user.
34
+
35
+
It is also possible to write the sequence of items to the local disk, to HDFS, to S3, etc in a way similar to how DataFrames are written back by Spark.
22
36
23
37
The design goal is that it should be possible to chain DataFrames between JSONiq and Spark SQL queries seamlessly. For example, JSONiq can be used to clean up very messy data and turn it into a clean DataFrame, which can then be processed with Spark SQL, spark.ml, etc.
24
38
@@ -33,10 +47,15 @@ pip install jsoniq
33
47
34
48
## Sample code
35
49
50
+
We will make more documentation available as we go. In the meantime, you will find a sample code below that should just run
51
+
after installing the library.
52
+
36
53
```
37
54
from jsoniq import RumbleSession
38
55
39
-
# The syntax to start a session is similar to Spark.
56
+
# The syntax to start a session is similar to that of Spark.
57
+
# A RumbleSession is a SparkSession that additionally knows about RumbleDB.
58
+
# All attributes and methods of SparkSession are also available on RumbleSession.
0 commit comments