Skip to content

Commit 34ca92a

Browse files
authored
[sc-198001] Fix out of memory while exporting large datasets + cache optimizations (#12)
* [sc-198001] Fix out of memory while exporting large datasets + cache optimizations * [sc-198001] Remove useless cache border check * [sc-198001] PR review fixes * Fix linter issues * Optimization: use openpyxl in write only mode with lxml
1 parent 1ffeae7 commit 34ca92a

7 files changed

Lines changed: 275 additions & 72 deletions

File tree

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,9 @@
11
# Changelog
22

3+
## [Version 2.1.0](https://github.com/dataiku/dss-plugin-multisheet-excel-export/releases/tag/v2.1.0) - Major release - 2024-09
4+
- Bug fix: one temporary workbook is used per dataset to avoid out of memory issues while exporting large datasets. All these temporary workbooks are merged at the end to generate the final excel file
5+
- Optimizations: using of a cache for styles to avoid useless copies + openpyxl write only mode with lxml
6+
37
## [Version 2.0.0](https://github.com/dataiku/dss-plugin-multisheet-excel-export/releases/tag/v2.0.0) - Major release - 2024-07
48
- Important : Column type changed ! From this version, cell types in excel will reflect the storage type in DSS. For example, string column containing only numbers will be exported as text column. If you want a number column in excel, you need to have a integer/float column on DSS
59
- Export dataset conditional formatting colors (colors the cells, does not export rules)

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ This plugin relies on the [openpyxl](https://openpyxl.readthedocs.io/en/stable/)
1212

1313
Once the plugin is successfully installed, select the datasets that you want to export as one excel file.
1414
Then run the Multi-Sheet Excel Export recipe from the flow.
15-
It will create a folder in your flow containing the output `.xls` file. Each sheet of this file contains one dataset and is named after this dataset.
15+
It will create a folder in your flow containing the output `.xlsx` file. Each sheet of this file contains one dataset and is named after this dataset.
1616

1717
## Running tests
1818

@@ -24,4 +24,4 @@ In order to run the tests contained in `python-test\`, launch the following comm
2424

2525
Copyright 2020-2022 Dataiku SAS
2626

27-
This plugin is distributed under the Apache License version 2.0
27+
This plugin is distributed under the Apache License version 2.0
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,3 @@
11
openpyxl==3.0.6
22
pathvalidate==2.3.0
3+
lxml==5.3.0

custom-recipes/to-excel/recipe.py

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -18,22 +18,23 @@
1818
from typing import Union
1919

2020
DEFAULT_DATAIKU_SHEET_NAME = "Sheet1"
21-
READ_CHUNK_SIZE = 1024 * 1024 # 1Mbytes
21+
READ_CHUNK_SIZE = 1024 * 1024 # 1Mbytes
22+
2223

2324
def get_excel_worksheet(dataset: dataiku.Dataset, apply_conditional_formatting: bool) -> Union[Workbook, None]:
24-
logger.info(f"Getting Excel workbook from DSS dataset {dataset.short_name}")
25+
logger.info(f"Getting Excel workbook from DSS dataset '{dataset.short_name}'...")
2526
workbook = None
2627
with tempfile.NamedTemporaryFile(delete=True) as tmp_file:
27-
with dataset.raw_formatted_data(format="excel", format_params={ "applyColoring": apply_conditional_formatting }) as stream:
28+
with dataset.raw_formatted_data(format="excel", format_params={"applyColoring": apply_conditional_formatting}) as stream:
2829
# read steam with chunks to save RAM
2930
chunk_size = READ_CHUNK_SIZE
3031
while True:
3132
chunk = stream.read(chunk_size)
3233
if not chunk:
3334
break
3435
tmp_file.write(chunk)
35-
tmp_file.flush() # Make sure file is written on disk
36-
tmp_file.seek(0) # Read back from start of file to load it in the workbook
36+
tmp_file.flush() # Make sure file is written on disk
37+
tmp_file.seek(0) # Read back from start of file to load it in the workbook
3738

3839
# DEV WARNING : Excel exported file contains header row in Calibri and rest in Aptos Narrow font. But load_workbook converts everything into Calibri
3940
workbook = load_workbook(tmp_file)
@@ -42,10 +43,10 @@ def get_excel_worksheet(dataset: dataiku.Dataset, apply_conditional_formatting:
4243
if DEFAULT_DATAIKU_SHEET_NAME in workbook:
4344
return workbook[DEFAULT_DATAIKU_SHEET_NAME]
4445
elif len(workbook.sheetnames) == 1:
45-
logger.warn(f"Default DSS default sheet name has changed from {DEFAULT_DATAIKU_SHEET_NAME} to {workbook.sheetnames[0]}")
46+
logger.warning(f"Default DSS default sheet name has changed from '{DEFAULT_DATAIKU_SHEET_NAME}' to '{workbook.sheetnames[0]}'")
4647
return workbook[workbook.sheetnames[0]]
4748

48-
logger.error("Error getting Excel workbook from DSS dataset {dataset.short_name}, this dataset will not be exported")
49+
logger.error(f"Error getting Excel workbook from DSS dataset '{dataset.short_name}', this dataset will not be exported")
4950
return None
5051

5152

plugin.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"id" : "multisheet-excel-export",
3-
"version" : "2.0.0",
3+
"version" : "2.1.0",
44

55

66
"meta" : {

0 commit comments

Comments
 (0)