Skip to content

Commit 4f634ef

Browse files
Update README.md
1 parent b4b4224 commit 4f634ef

1 file changed

Lines changed: 34 additions & 14 deletions

File tree

README.md

Lines changed: 34 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,13 @@
22

33
by Abishek Ramdas and Ghislain Fourny
44

5-
This is the Python edition of [RumbleDB](https://rumbledb.org/), which brings [JSONiq](https://www.jsoniq.org) to the world of Spark and DataFrames. JSONiq is a language considerably more powerful than SQL as it can process [messy, heterogeneous datasets](https://arxiv.org/abs/1910.11582), from kilobytes to Petabytes, with very little coding effort.
5+
This is the Python edition of [RumbleDB](https://rumbledb.org/), which brings [JSONiq](https://www.jsoniq.org) to the world of Python.
66

7-
The Python edition of RumbleDB is currently only a prototype (alpha) and probably unstable. We welcome bug reports.
7+
JSONiq is a language considerably more powerful than SQL as it can process [messy, heterogeneous datasets](https://arxiv.org/abs/1910.11582), from kilobytes to Petabytes, with very little coding effort.
8+
9+
Spark aficionados can also pass DataFrames to JSONiq queries and take back DataFrames. This gives them an environment in which both Spark SQL and JSONiq co-exist to manipulate the data.
10+
11+
The Python edition of RumbleDB is currently a prototype (alpha) and probably unstable. We welcome bug reports and feedback.
812

913
## About RumbleDB
1014

@@ -24,20 +28,33 @@ A RumbleSession is a wrapper around a SparkSession that additionally makes sure
2428

2529
JSONiq queries are invoked with rumble.jsoniq() in a way similar to the way Spark SQL queries are invoked with spark.sql().
2630

27-
Any number of Python DataFrames can be attached to external JSONiq variables used in the query. It will later also be possible to read tables registered in the Hive metastore, similar to spark.sql(). Alternatively, the JSONiq query can also read many files of many different formats from many places (local drive, HTTP, S3, HDFS, ...) directly with simple builtin function calls such as json-lines(), text-file(), parquet-file(), csv-file(), etc. See [RumbleDB's documentation](https://rumble.readthedocs.io/en/latest/).
28-
29-
The resulting sequence of items can be retrieved as DataFrame, as an RDD, as a Python list, or with a streaming iteration over the items.
31+
JSONiq variables can be bound to lists of JSON values (str, int, float, True, False, None, dict, list) or to Pyspark DataFrames. A JSONiq query can use as many variables as needed (for example, it can join between different collections).
3032

31-
The individual items can be processed using the RumbleDB [Item API](https://github.com/RumbleDB/rumble/blob/master/src/main/java/org/rumbledb/api/Item.java).
33+
It will later also be possible to read tables registered in the Hive metastore, similar to spark.sql(). Alternatively, the JSONiq query can also read many files of many different formats from many places (local drive, HTTP, S3, HDFS, ...) directly with simple builtin function calls such as json-lines(), text-file(), parquet-file(), csv-file(), etc. See [RumbleDB's documentation](https://rumble.readthedocs.io/en/latest/).
3234

33-
Alternatively, it is possible to directly get a Python list of JSON values, or a streaming iteration of JSON values. This is a convenience that makes it unnecessary to use the Item API, especially for a first-time user.
35+
The resulting sequence of items can be retrieved as a list of JSON values, as a Pyspark DataFrame, or, for advanced users, as an RDD or with a streaming iteration over the items using the [RumbleDB Item API](https://github.com/RumbleDB/rumble/blob/master/src/main/java/org/rumbledb/api/Item.java).
3436

35-
It is also possible to write the sequence of items to the local disk, to HDFS, to S3, etc in a way similar to how DataFrames are written back by Spark.
37+
It is also possible to write the sequence of items to the local disk, to HDFS, to S3, etc in a way similar to how DataFrames are written back by Pyspark.
3638

37-
The design goal is that it should be possible to chain DataFrames between JSONiq and Spark SQL queries seamlessly. For example, JSONiq can be used to clean up very messy data and turn it into a clean DataFrame, which can then be processed with Spark SQL, spark.ml, etc.
39+
The design goal is that it is possible to chain DataFrames between JSONiq and Spark SQL queries seamlessly. For example, JSONiq can be used to clean up very messy data and turn it into a clean DataFrame, which can then be processed with Spark SQL, spark.ml, etc.
3840

3941
Any feedback or error reports are very welcome.
4042

43+
## Type mapping
44+
45+
When passing Python values to JSONiq or getting them from a JSONiq queries, the mapping is as follows:
46+
47+
| Python | JSONiq |
48+
|-------|-------|
49+
|dict|object|
50+
|list|array|
51+
|str|string|
52+
|int|integer|
53+
|bool|boolean|
54+
|None|null|
55+
56+
Furthermore, other JSONiq types will be mapped to string literals. Users who want to preserve JSONiq types can use the Item API instead.
57+
4158
## Installation
4259

4360
Install with
@@ -49,7 +66,7 @@ pip install jsoniq
4966

5067
## Sample code
5168

52-
We will make more documentation available as we go. In the meantime, you will find a sample code below that should just run
69+
We will make more documentation available as we go. In the meantime, you will find a sample, commented code below that should just run
5370
after installing the library.
5471

5572
You can directly copy paste the code below to a Python file and execute it with Python.
@@ -60,9 +77,11 @@ from jsoniq import RumbleSession
6077
# The syntax to start a session is similar to that of Spark.
6178
# A RumbleSession is a SparkSession that additionally knows about RumbleDB.
6279
# All attributes and methods of SparkSession are also available on RumbleSession.
80+
6381
rumble = RumbleSession.builder.getOrCreate();
6482
6583
# Just to improve readability when invoking Spark methods
84+
# (such as spark.sql() or spark.createDataFrame()).
6685
spark = rumble
6786
6887
##############################
@@ -75,10 +94,11 @@ spark = rumble
7594
# of items, here the sequence with just the integer item 2.
7695
items = rumble.jsoniq('1+1')
7796
78-
# A sequence of items can simply be converted to a list of Python values with json().
79-
# Since there is only one value in the sequence output by this query, we get a singleton list with the integer 2.
97+
# A sequence of items can simply be converted to a list of Python/JSON values with json().
98+
# Since there is only one value in the sequence output by this query,
99+
# we get a singleton list with the integer 2.
100+
# Generally though, the results may contain zero, one, two, or more items.
80101
python_list = items.json()
81-
82102
print(python_list)
83103
84104
############################################
@@ -141,7 +161,7 @@ print(seq.json());
141161
# queries are sequence of items.
142162
# A Python list will be seamlessly converted to a sequence of items by the library.
143163
# Currently we only support strs, ints, floats, booleans, None, lists, and dicts.
144-
# But if you need more (like date, bytes, etc) we can add them without any problem.
164+
# But if you need more (like date, bytes, etc) we will add them without any problem.
145165
# JSONiq has a rich type system.
146166
147167
rumble.bind('$c', [1,2,3,4, 5, 6])

0 commit comments

Comments
 (0)