Skip to content

Commit a4bb9df

Browse files
Merge pull request #6 from RumbleDB/BindPythonValues
Bind python values
2 parents 289751e + 000ff50 commit a4bb9df

5 files changed

Lines changed: 225 additions & 69 deletions

File tree

README.md

Lines changed: 170 additions & 65 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,13 @@
22

33
by Abishek Ramdas and Ghislain Fourny
44

5-
This is the Python edition of [RumbleDB](https://rumbledb.org/), which brings [JSONiq](https://www.jsoniq.org) to the world of Spark and DataFrames. JSONiq is a language considerably more powerful than SQL as it can process [messy, heterogeneous datasets](https://arxiv.org/abs/1910.11582), from kilobytes to Petabytes, with very little coding effort.
5+
This is the Python edition of [RumbleDB](https://rumbledb.org/), which brings [JSONiq](https://www.jsoniq.org) to the world of Python.
66

7-
The Python edition of RumbleDB is currently only a prototype (alpha) and probably unstable. We welcome bug reports.
7+
JSONiq is a language considerably more powerful than SQL as it can process [messy, heterogeneous datasets](https://arxiv.org/abs/1910.11582), from kilobytes to Petabytes, with very little coding effort.
8+
9+
Spark aficionados can also pass DataFrames to JSONiq queries and take back DataFrames. This gives them an environment in which both Spark SQL and JSONiq co-exist to manipulate the data.
10+
11+
The Python edition of RumbleDB is currently a prototype (alpha) and probably unstable. We welcome bug reports and feedback.
812

913
## About RumbleDB
1014

@@ -24,20 +28,33 @@ A RumbleSession is a wrapper around a SparkSession that additionally makes sure
2428

2529
JSONiq queries are invoked with rumble.jsoniq() in a way similar to the way Spark SQL queries are invoked with spark.sql().
2630

27-
Any number of Python DataFrames can be attached to external JSONiq variables used in the query. It will later also be possible to read tables registered in the Hive metastore, similar to spark.sql(). Alternatively, the JSONiq query can also read many files of many different formats from many places (local drive, HTTP, S3, HDFS, ...) directly with simple builtin function calls such as json-lines(), text-file(), parquet-file(), csv-file(), etc. See [RumbleDB's documentation](https://rumble.readthedocs.io/en/latest/).
31+
JSONiq variables can be bound to lists of JSON values (str, int, float, True, False, None, dict, list) or to Pyspark DataFrames. A JSONiq query can use as many variables as needed (for example, it can join between different collections).
2832

29-
The resulting sequence of items can be retrieved as DataFrame, as an RDD, as a Python list, or with a streaming iteration over the items.
33+
It will later also be possible to read tables registered in the Hive metastore, similar to spark.sql(). Alternatively, the JSONiq query can also read many files of many different formats from many places (local drive, HTTP, S3, HDFS, ...) directly with simple builtin function calls such as json-lines(), text-file(), parquet-file(), csv-file(), etc. See [RumbleDB's documentation](https://rumble.readthedocs.io/en/latest/).
3034

31-
The individual items can be processed using the RumbleDB [Item API](https://github.com/RumbleDB/rumble/blob/master/src/main/java/org/rumbledb/api/Item.java).
35+
The resulting sequence of items can be retrieved as a list of JSON values, as a Pyspark DataFrame, or, for advanced users, as an RDD or with a streaming iteration over the items using the [RumbleDB Item API](https://github.com/RumbleDB/rumble/blob/master/src/main/java/org/rumbledb/api/Item.java).
3236

33-
Alternatively, it is possible to directly get a Python list of JSON values, or a streaming iteration of JSON values. This is a convenience that makes it unnecessary to use the Item API, especially for a first-time user.
37+
It is also possible to write the sequence of items to the local disk, to HDFS, to S3, etc in a way similar to how DataFrames are written back by Pyspark.
3438

35-
It is also possible to write the sequence of items to the local disk, to HDFS, to S3, etc in a way similar to how DataFrames are written back by Spark.
36-
37-
The design goal is that it should be possible to chain DataFrames between JSONiq and Spark SQL queries seamlessly. For example, JSONiq can be used to clean up very messy data and turn it into a clean DataFrame, which can then be processed with Spark SQL, spark.ml, etc.
39+
The design goal is that it is possible to chain DataFrames between JSONiq and Spark SQL queries seamlessly. For example, JSONiq can be used to clean up very messy data and turn it into a clean DataFrame, which can then be processed with Spark SQL, spark.ml, etc.
3840

3941
Any feedback or error reports are very welcome.
4042

43+
## Type mapping
44+
45+
When passing Python values to JSONiq or getting them from a JSONiq queries, the mapping is as follows:
46+
47+
| Python | JSONiq |
48+
|-------|-------|
49+
|dict|object|
50+
|list|array|
51+
|str|string|
52+
|int|integer|
53+
|bool|boolean|
54+
|None|null|
55+
56+
Furthermore, other JSONiq types will be mapped to string literals. Users who want to preserve JSONiq types can use the Item API instead.
57+
4158
## Installation
4259

4360
Install with
@@ -49,7 +66,7 @@ pip install jsoniq
4966

5067
## Sample code
5168

52-
We will make more documentation available as we go. In the meantime, you will find a sample code below that should just run
69+
We will make more documentation available as we go. In the meantime, you will find a sample, commented code below that should just run
5370
after installing the library.
5471

5572
You can directly copy paste the code below to a Python file and execute it with Python.
@@ -60,15 +77,145 @@ from jsoniq import RumbleSession
6077
# The syntax to start a session is similar to that of Spark.
6178
# A RumbleSession is a SparkSession that additionally knows about RumbleDB.
6279
# All attributes and methods of SparkSession are also available on RumbleSession.
63-
rumble = RumbleSession.builder.appName("PyRumbleExample").getOrCreate();
80+
81+
rumble = RumbleSession.builder.getOrCreate();
82+
83+
# Just to improve readability when invoking Spark methods
84+
# (such as spark.sql() or spark.createDataFrame()).
85+
spark = rumble
86+
87+
##############################
88+
###### Your first query ######
89+
##############################
90+
91+
# Even though RumbleDB uses Spark internally, it can be used without any knowledge of Spark.
92+
93+
# Executing a query is done with rumble.jsoniq() like so. A query returns a sequence
94+
# of items, here the sequence with just the integer item 2.
95+
items = rumble.jsoniq('1+1')
96+
97+
# A sequence of items can simply be converted to a list of Python/JSON values with json().
98+
# Since there is only one value in the sequence output by this query,
99+
# we get a singleton list with the integer 2.
100+
# Generally though, the results may contain zero, one, two, or more items.
101+
python_list = items.json()
102+
print(python_list)
103+
104+
############################################
105+
##### More complex, standalone queries #####
106+
############################################
107+
108+
# JSONiq is very powerful and expressive. You will find tutorials as well as a reference on JSONiq.org.
109+
110+
seq = rumble.jsoniq("""
111+
112+
let $stores :=
113+
[
114+
{ "store number" : 1, "state" : "MA" },
115+
{ "store number" : 2, "state" : "MA" },
116+
{ "store number" : 3, "state" : "CA" },
117+
{ "store number" : 4, "state" : "CA" }
118+
]
119+
let $sales := [
120+
{ "product" : "broiler", "store number" : 1, "quantity" : 20 },
121+
{ "product" : "toaster", "store number" : 2, "quantity" : 100 },
122+
{ "product" : "toaster", "store number" : 2, "quantity" : 50 },
123+
{ "product" : "toaster", "store number" : 3, "quantity" : 50 },
124+
{ "product" : "blender", "store number" : 3, "quantity" : 100 },
125+
{ "product" : "blender", "store number" : 3, "quantity" : 150 },
126+
{ "product" : "socks", "store number" : 1, "quantity" : 500 },
127+
{ "product" : "socks", "store number" : 2, "quantity" : 10 },
128+
{ "product" : "shirt", "store number" : 3, "quantity" : 10 }
129+
]
130+
let $join :=
131+
for $store in $stores[], $sale in $sales[]
132+
where $store."store number" = $sale."store number"
133+
return {
134+
"nb" : $store."store number",
135+
"state" : $store.state,
136+
"sold" : $sale.product
137+
}
138+
return [$join]
139+
""");
140+
141+
print(seq.json());
142+
143+
seq = rumble.jsoniq("""
144+
for $product in json-lines("http://rumbledb.org/samples/products-small.json", 10)
145+
group by $store-number := $product.store-number
146+
order by $store-number ascending
147+
return {
148+
"store" : $store-number,
149+
"products" : [ distinct-values($product.product) ]
150+
}
151+
""");
152+
print(seq.json());
153+
154+
############################################################
155+
###### Binding JSONiq variables to Python values ###########
156+
############################################################
157+
158+
# It is possible to bind a JSONiq variable to a list of native Python values
159+
# and then use it in a query.
160+
# JSONiq, variables are bound to sequences of items, just like the results of JSONiq
161+
# queries are sequence of items.
162+
# A Python list will be seamlessly converted to a sequence of items by the library.
163+
# Currently we only support strs, ints, floats, booleans, None, lists, and dicts.
164+
# But if you need more (like date, bytes, etc) we will add them without any problem.
165+
# JSONiq has a rich type system.
166+
167+
rumble.bind('$c', [1,2,3,4, 5, 6])
168+
print(rumble.jsoniq("""
169+
for $v in $c
170+
let $parity := $v mod 2
171+
group by $parity
172+
return { switch($parity)
173+
case 0 return "even"
174+
case 1 return "odd"
175+
default return "?" : $v
176+
}
177+
""").json())
178+
179+
rumble.bind('$c', [[1,2,3],[4,5,6]])
180+
print(rumble.jsoniq("""
181+
for $i in $c
182+
return [
183+
for $j in $i
184+
return { "foo" : $j }
185+
]
186+
""").json())
187+
188+
rumble.bind('$c', [{"foo":[1,2,3]},{"foo":[4,{"bar":[1,False, None]},6]}])
189+
print(rumble.jsoniq('{ "results" : $c.foo[[2]] }').json())
190+
191+
# It is possible to bind only one value. The it must be provided as a singleton list.
192+
# This is because in JSONiq, an item is the same a sequence of one item.
193+
rumble.bind('$c', [42])
194+
print(rumble.jsoniq('for $i in 1 to $c return $i*$i').json())
195+
196+
# For convenience and code readability, you can also use bindOne().
197+
rumble.bindOne('$c', 42)
198+
print(rumble.jsoniq('for $i in 1 to $c return $i*$i').json())
199+
200+
201+
################################################
202+
##### Using Pyspark DataFrames with JSONiq #####
203+
################################################
204+
205+
# The power users can also interface our library with pyspark DataFrames.
206+
# JSONiq sequences of items can have billions of items, and our library supports this
207+
# out of the box: it can also run on clusters on AWS Elastic MapReduce for example.
208+
# But your laptop is just fine, too: it will spread the computations on your cores.
209+
# You can bind a DataFrame to a JSONiq variable. JSONiq will recognize this
210+
# DataFrame as a sequence of object items.
64211
65212
# Create a data frame also similar to Spark (but using the rumble object).
66213
data = [("Alice", 30), ("Bob", 25), ("Charlie", 35)];
67214
columns = ["Name", "Age"];
68-
df = rumble.createDataFrame(data, columns);
215+
df = spark.createDataFrame(data, columns);
69216
70217
# This is how to bind a JSONiq variable to a dataframe. You can bind as many variables as you want.
71-
rumble.bindDataFrameAsVariable('$a', df);
218+
rumble.bind('$a', df);
72219
73220
# This is how to run a query. This is similar to spark.sql().
74221
# Since variable $a was bound to a DataFrame, it is automatically declared as an external variable
@@ -101,18 +248,18 @@ df.show();
101248
102249
# A DataFrame output by JSONiq can be reused as input to a Spark SQL query.
103250
# (Remember that rumble is a wrapper around a SparkSession object, so you can use rumble.sql() just like spark.sql())
104-
df.createTempView("input")
105-
df2 = rumble.sql("SELECT * FROM input").toDF("name");
251+
df.createTempView("myview")
252+
df2 = spark.sql("SELECT * FROM myview").toDF("name");
106253
df2.show();
107254
108255
# A DataFrame output by Spark SQL can be reused as input to a JSONiq query.
109-
rumble.bindDataFrameAsVariable('$b', df2);
256+
rumble.bind('$b', df2);
110257
seq2 = rumble.jsoniq("for $i in 1 to 5 return $b");
111258
df3 = seq2.df();
112259
df3.show();
113260
114261
# And a DataFrame output by JSONiq can be reused as input to another JSONiq query.
115-
rumble.bindDataFrameAsVariable('$b', df3);
262+
rumble.bind('$b', df3);
116263
seq3 = rumble.jsoniq("$b[position() lt 3]");
117264
df4 = seq3.df();
118265
df4.show();
@@ -170,61 +317,19 @@ seq.write().mode("overwrite").parquet("outputparquet");
170317
seq = rumble.jsoniq("1+1");
171318
seq.write().mode("overwrite").text("outputtext");
172319
173-
############################################
174-
##### More complex, standalone queries #####
175-
############################################
176-
177-
seq = rumble.jsoniq("""
178-
179-
let $stores :=
180-
[
181-
{ "store number" : 1, "state" : "MA" },
182-
{ "store number" : 2, "state" : "MA" },
183-
{ "store number" : 3, "state" : "CA" },
184-
{ "store number" : 4, "state" : "CA" }
185-
]
186-
let $sales := [
187-
{ "product" : "broiler", "store number" : 1, "quantity" : 20 },
188-
{ "product" : "toaster", "store number" : 2, "quantity" : 100 },
189-
{ "product" : "toaster", "store number" : 2, "quantity" : 50 },
190-
{ "product" : "toaster", "store number" : 3, "quantity" : 50 },
191-
{ "product" : "blender", "store number" : 3, "quantity" : 100 },
192-
{ "product" : "blender", "store number" : 3, "quantity" : 150 },
193-
{ "product" : "socks", "store number" : 1, "quantity" : 500 },
194-
{ "product" : "socks", "store number" : 2, "quantity" : 10 },
195-
{ "product" : "shirt", "store number" : 3, "quantity" : 10 }
196-
]
197-
let $join :=
198-
for $store in $stores[], $sale in $sales[]
199-
where $store."store number" = $sale."store number"
200-
return {
201-
"nb" : $store."store number",
202-
"state" : $store.state,
203-
"sold" : $sale.product
204-
}
205-
return [$join]
206-
""");
207-
208-
print(seq.json());
209-
210-
seq = rumble.jsoniq("""
211-
for $product in json-lines("http://rumbledb.org/samples/products-small.json", 10)
212-
group by $store-number := $product.store-number
213-
order by $store-number ascending
214-
return {
215-
"store" : $store-number,
216-
"products" : [ distinct-values($product.product) ]
217-
}
218-
""");
219-
print(seq.json());
220-
221320
```
222321
# How to learn JSONiq, and more query examples
223322

224323
Even more queries can be found [here](https://colab.research.google.com/github/RumbleDB/rumble/blob/master/RumbleSandbox.ipynb) and you can look at the [JSONiq documentation](https://www.jsoniq.org) and tutorials.
225324

226325
# Last updates
227326

327+
## Version 0.1.0 alpha 12
328+
- Allow to bind JSONiq variables to Python values (mapping Python lists to sequences of items). This makes it possible to manipulate Python values directly with JSONiq and even without any knowledge of Spark at all.
329+
- renamed bindDataFrameAsVariable() to bind(), which can be used both with DataFrames and Python lists.
330+
- add bindOne() for binding a single value to a JSONiq variable.
331+
- wrapping df() in a Pyspark DataFrame to make sure it can be used with pyspark DataFrame transformations.
332+
228333
## Version 0.1.0 alpha 11
229334
- Fix an issue when feeding a DataFrame output by rumble.jsoniq() back to a new JSONiq query (as a variable).
230335

pyproject.toml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
44

55
[project]
66
name = "jsoniq"
7-
version = "0.1.0a11"
7+
version = "0.1.0a12"
88
description = "Python edition of RumbleDB, a JSONiq engine"
99
requires-python = ">=3.11"
1010
dependencies = [
@@ -21,6 +21,8 @@ classifiers = [
2121
"Development Status :: 3 - Alpha",
2222
"Intended Audience :: Developers",
2323
"Programming Language :: Python :: 3.11",
24+
"Programming Language :: Python :: 3.12",
25+
"Programming Language :: Python :: 3.13",
2426
]
2527

2628
[tool.setuptools.packages.find]
1.48 MB
Binary file not shown.

src/jsoniq/sequence.py

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,13 @@
11
from pyspark import RDD
22
from pyspark.sql import SparkSession
3+
from pyspark.sql import DataFrame
34
import json
45

56
class SequenceOfItems:
6-
def __init__(self, sequence, sparkcontext):
7+
def __init__(self, sequence, sparksession):
78
self._jsequence = sequence
8-
self._sparkcontext = sparkcontext
9+
self._sparkcontext = sparksession.sparkContext
10+
self._sparksession = sparksession
911

1012
def json(self):
1113
return [json.loads(l.serializeAsJSON()) for l in self._jsequence.items()]
@@ -15,6 +17,9 @@ def rdd(self):
1517
rdd = RDD(rdd, self._sparkcontext)
1618
return rdd.map(lambda l: json.loads(l))
1719

20+
def df(self):
21+
return DataFrame(self._jsequence.getAsDataFrame(), self._sparksession)
22+
1823
def nextJSON(self):
1924
return self._jsequence.next().serializeAsJSON()
2025

0 commit comments

Comments
 (0)