Skip to content

Commit e662597

Browse files
Support for pandas dataframes, and mapping sequences to Python tuples
Pandas
2 parents a4bb9df + 1f4adc9 commit e662597

5 files changed

Lines changed: 70 additions & 18 deletions

File tree

README.md

Lines changed: 38 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -42,10 +42,16 @@ Any feedback or error reports are very welcome.
4242

4343
## Type mapping
4444

45-
When passing Python values to JSONiq or getting them from a JSONiq queries, the mapping is as follows:
45+
Any expression in JSONiq returns a sequence of items. Any variable in JSONiq is bound to a sequence of items.
46+
Items can be objects, arrays, or atomic values (strings, integers, booleans, nulls, dates, binary, durations, doubles, decimal numbers, etc).
47+
A sequence of items can be a sequence of just one item, but it can also be empty, or it can be as large as to contain millions, billions or even trillions of items. Obviously, for sequence longer than a billion items, it is a better idea to use a cluster than a laptop.
48+
A relational table (or more generally a data frame) corresponds to a sequence of object items sharing the same schema. However, sequences of items are more general than tables or data frames and support heterogeneity seamlessly.
49+
50+
When passing Python values to JSONiq or getting them from a JSONiq queries, the mapping to and from Python is as follows:
4651

4752
| Python | JSONiq |
4853
|-------|-------|
54+
|tuple|sequence of items|
4955
|dict|object|
5056
|list|array|
5157
|str|string|
@@ -73,6 +79,7 @@ You can directly copy paste the code below to a Python file and execute it with
7379

7480
```
7581
from jsoniq import RumbleSession
82+
import pandas as pd
7683
7784
# The syntax to start a session is similar to that of Spark.
7885
# A RumbleSession is a SparkSession that additionally knows about RumbleDB.
@@ -155,16 +162,16 @@ print(seq.json());
155162
###### Binding JSONiq variables to Python values ###########
156163
############################################################
157164
158-
# It is possible to bind a JSONiq variable to a list of native Python values
165+
# It is possible to bind a JSONiq variable to a tuple of native Python values
159166
# and then use it in a query.
160167
# JSONiq, variables are bound to sequences of items, just like the results of JSONiq
161168
# queries are sequence of items.
162-
# A Python list will be seamlessly converted to a sequence of items by the library.
169+
# A Python tuple will be seamlessly converted to a sequence of items by the library.
163170
# Currently we only support strs, ints, floats, booleans, None, lists, and dicts.
164171
# But if you need more (like date, bytes, etc) we will add them without any problem.
165172
# JSONiq has a rich type system.
166173
167-
rumble.bind('$c', [1,2,3,4, 5, 6])
174+
rumble.bind('$c', (1,2,3,4, 5, 6))
168175
print(rumble.jsoniq("""
169176
for $v in $c
170177
let $parity := $v mod 2
@@ -176,7 +183,7 @@ return { switch($parity)
176183
}
177184
""").json())
178185
179-
rumble.bind('$c', [[1,2,3],[4,5,6]])
186+
rumble.bind('$c', ([1,2,3],[4,5,6]))
180187
print(rumble.jsoniq("""
181188
for $i in $c
182189
return [
@@ -185,18 +192,34 @@ return [
185192
]
186193
""").json())
187194
188-
rumble.bind('$c', [{"foo":[1,2,3]},{"foo":[4,{"bar":[1,False, None]},6]}])
195+
rumble.bind('$c', ({"foo":[1,2,3]},{"foo":[4,{"bar":[1,False, None]},6]}))
189196
print(rumble.jsoniq('{ "results" : $c.foo[[2]] }').json())
190197
191-
# It is possible to bind only one value. The it must be provided as a singleton list.
198+
# It is possible to bind only one value. The it must be provided as a singleton tuple.
192199
# This is because in JSONiq, an item is the same a sequence of one item.
193-
rumble.bind('$c', [42])
200+
rumble.bind('$c', (42,))
194201
print(rumble.jsoniq('for $i in 1 to $c return $i*$i').json())
195202
196203
# For convenience and code readability, you can also use bindOne().
197204
rumble.bindOne('$c', 42)
198205
print(rumble.jsoniq('for $i in 1 to $c return $i*$i').json())
199206
207+
##########################################################
208+
##### Binding JSONiq variables to pandas DataFrames ######
209+
##### Getting the output as a Pandas DataFrame ######
210+
##########################################################
211+
212+
# Creating a dummy pandas dataframe
213+
data = {'Name': ['Alice', 'Bob', 'Charlie'],
214+
'Age': [30,25,35]};
215+
pdf = pd.DataFrame(data);
216+
217+
# Binding a pandas dataframe
218+
rumble.bind('$a',pdf);
219+
seq = rumble.jsoniq('$a.Name')
220+
# Getting the output as a pandas dataframe
221+
print(seq.pdf())
222+
200223
201224
################################################
202225
##### Using Pyspark DataFrames with JSONiq #####
@@ -324,6 +347,13 @@ Even more queries can be found [here](https://colab.research.google.com/github/R
324347

325348
# Last updates
326349

350+
## Version 0.1.0 alpha 13
351+
- Allow to bind JSONiq variables to pandas dataframes
352+
- Allow to retrieve the output of a JSONiq query as a pandas dataframe (if the output is available as a dataframe, i.e., availableOutputs() returns a list that contains "DataFrame")
353+
- Clean up the mapping to strictly map tuples to sequence of items, and lists ot array items. This will avoid confusion between arrays and sequences.
354+
- As a consequence, json() now returns a tuple, not a list.
355+
- Calling bind() with a single list will return an informative error. Use bind() with a tuple instead, or call bindOne() to interpret the list as a sequence of one array item.
356+
327357
## Version 0.1.0 alpha 12
328358
- Allow to bind JSONiq variables to Python values (mapping Python lists to sequences of items). This makes it possible to manipulate Python values directly with JSONiq and even without any knowledge of Spark at all.
329359
- renamed bindDataFrameAsVariable() to bind(), which can be used both with DataFrames and Python lists.

pyproject.toml

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,12 @@ build-backend = "setuptools.build_meta"
44

55
[project]
66
name = "jsoniq"
7-
version = "0.1.0a12"
7+
version = "0.2.0a1"
88
description = "Python edition of RumbleDB, a JSONiq engine"
99
requires-python = ">=3.11"
1010
dependencies = [
11-
"pyspark==4.0"
11+
"pyspark==4.0",
12+
"pandas==2.3"
1213
]
1314
authors = [
1415
{name = "Ghislain Fourny", email = "ghislain.fourny@inf.ethz.ch"},
@@ -23,6 +24,8 @@ classifiers = [
2324
"Programming Language :: Python :: 3.11",
2425
"Programming Language :: Python :: 3.12",
2526
"Programming Language :: Python :: 3.13",
27+
"Typing :: Typed",
28+
"License :: OSI Approved :: Apache Software License"
2629
]
2730

2831
[tool.setuptools.packages.find]

requirements.txt

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,2 @@
1-
pyspark==4.0.0
1+
pyspark==4.0
2+
pandas==2.3

src/jsoniq/sequence.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ def __init__(self, sequence, sparksession):
1010
self._sparksession = sparksession
1111

1212
def json(self):
13-
return [json.loads(l.serializeAsJSON()) for l in self._jsequence.items()]
13+
return tuple([json.loads(l.serializeAsJSON()) for l in self._jsequence.items()])
1414

1515
def rdd(self):
1616
rdd = self._jsequence.getAsPickledStringRDD();
@@ -20,6 +20,9 @@ def rdd(self):
2020
def df(self):
2121
return DataFrame(self._jsequence.getAsDataFrame(), self._sparksession)
2222

23+
def pdf(self):
24+
return self.df().toPandas()
25+
2326
def nextJSON(self):
2427
return self._jsequence.next().serializeAsJSON()
2528

src/jsoniq/session.py

Lines changed: 21 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
import platform
55
import os
66
import re
7+
import pandas as pd
78
import importlib.resources as pkg_resources
89

910
with pkg_resources.path("jsoniq.jars", "rumbledb-1.24.0.jar") as jar_path:
@@ -84,6 +85,8 @@ def __getattr__(self, name):
8485
_builder = Builder()
8586

8687
def convert(self, value):
88+
if isinstance(value, tuple):
89+
return [ self.convert(v) for v in value]
8790
if isinstance(value, bool):
8891
return self._sparksession._jvm.org.rumbledb.items.ItemFactory.getInstance().createBooleanItem(value)
8992
elif isinstance(value, str):
@@ -114,18 +117,30 @@ def bind(self, name: str, valueToBind):
114117
if not name.startswith("$"):
115118
raise ValueError("Variable name must start with a dollar symbol ('$').")
116119
name = name[1:]
117-
if isinstance(valueToBind, list):
118-
items = [ self.convert(value) for value in valueToBind]
119-
conf.setExternalVariableValue(name, items)
120-
return self
121-
if(hasattr(valueToBind, "_get_object_id")):
120+
if isinstance(valueToBind, SequenceOfItems):
121+
outputs = valueToBind.availableOutputs()
122+
if isinstance(outputs, list) and "DataFrame" in outputs:
123+
conf.setExternalVariableValue(name, valueToBind.df());
124+
# TODO support binding a variable to an RDD
125+
#elif isinstance(outputs, list) and "RDD" in outputs:
126+
# conf.setExternalVariableValue(name, valueToBind.getAsRDD());
127+
else:
128+
conf.setExternalVariableValue(name, valueToBind.items());
129+
elif isinstance(valueToBind, pd.DataFrame):
130+
pysparkdf = self._sparksession.createDataFrame(valueToBind)
131+
conf.setExternalVariableValue(name, pysparkdf._jdf);
132+
elif isinstance(valueToBind, tuple):
133+
conf.setExternalVariableValue(name, self.convert(valueToBind))
134+
elif isinstance(valueToBind, list):
135+
raise ValueError("To avoid confusion, a sequence of items must be provided as a Python tuple, not as a Python list. Lists are mapped to single array items, while tuples are mapped to sequences of items. If you want to interpret the list as a sequence of items (one item for each list member), then you need to change this list to a tuple by wrapping it into a tuple() call. If you want to bind the variable to one array item, then you need to wrap the provided list inside a singleton tuple and try again, or you can also call bindOne() instead.")
136+
elif(hasattr(valueToBind, "_get_object_id")):
122137
conf.setExternalVariableValue(name, valueToBind);
123138
else:
124139
conf.setExternalVariableValue(name, valueToBind._jdf);
125140
return self;
126141

127142
def bindOne(self, name: str, value):
128-
return self.bind(name, [value])
143+
return self.bind(name, (value,))
129144

130145
def bindDataFrameAsVariable(self, name: str, df):
131146
conf = self._jrumblesession.getConfiguration();

0 commit comments

Comments
 (0)