Skip to content

Commit 389dd4a

Browse files
dev sep 25 (#15)
* Bug fix : - Property dictionnary access - DOR reading with epc.as_dor() function - set_attribute_from_path: take care of list parent * New : - Epc/Object validations have been improved. - New function to ease upload of data arrays to etp server (to get the proxy uri or the uri of the object itself) : energyml.utils.data.datasets_io.get_proxy_uri_for_path_in_external(...) - Regex optimisation by using precompiled ones - New class for huge file : EpcStreamReader
1 parent f54bfab commit 389dd4a

17 files changed

Lines changed: 1995 additions & 388 deletions

energyml-utils/.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,4 +57,5 @@ manip*
5757

5858

5959
# WIP
60-
src/energyml/utils/wip*
60+
src/energyml/utils/wip*
61+
scripts
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# .pre-commit-config.yaml
2+
repos:
3+
- repo: https://github.com/psf/black
4+
rev: 23.1.0
5+
hooks:
6+
- id: black
7+
- repo: https://github.com/pycqa/isort
8+
rev: 5.12.0
9+
hooks:
10+
- id: isort
11+
- repo: https://github.com/pycqa/flake8
12+
rev: 6.0.0
13+
hooks:
14+
- id: flake8

energyml-utils/README.md

Lines changed: 149 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,144 @@ energyml-prodml2-2 = "^1.12.0"
7676
- The "EnergymlWorkspace" class allows to abstract the access of numerical data like "ExternalArrays". This class can thus be extended to interact with ETP "GetDataArray" request etc...
7777
- ETP URI support : the "Uri" class allows to parse/write an etp uri.
7878

79+
## EPC Stream Reader
80+
81+
The **EpcStreamReader** provides memory-efficient handling of large EPC files through lazy loading and smart caching. Unlike the standard `Epc` class which loads all objects into memory, the stream reader loads objects on-demand, making it ideal for handling very large EPC files with thousands of objects.
82+
83+
### Key Features
84+
85+
- **Lazy Loading**: Objects are loaded only when accessed, reducing memory footprint
86+
- **Smart Caching**: LRU (Least Recently Used) cache with configurable size
87+
- **Automatic EPC Version Detection**: Supports both CLASSIC and EXPANDED EPC formats
88+
- **Add/Remove/Update Operations**: Full CRUD operations with automatic file structure maintenance
89+
- **Context Management**: Automatic resource cleanup with `with` statements
90+
- **Memory Monitoring**: Track cache efficiency and memory usage statistics
91+
92+
### Basic Usage
93+
94+
```python
95+
from energyml.utils.epc_stream import EpcStreamReader
96+
97+
# Open EPC file with context manager (recommended)
98+
with EpcStreamReader('large_file.epc', cache_size=50) as reader:
99+
# List all objects without loading them
100+
print(f"Total objects: {reader.stats.total_objects}")
101+
102+
# Get object by identifier
103+
obj: Any = reader.get_object_by_identifier("uuid.version")
104+
105+
# Get objects by type
106+
features: List[Any] = reader.get_objects_by_type("BoundaryFeature")
107+
108+
# Get all objects with same UUID
109+
versions: List[Any] = reader.get_object_by_uuid("12345678-1234-1234-1234-123456789abc")
110+
```
111+
112+
### Adding Objects
113+
114+
```python
115+
from energyml.utils.epc_stream import EpcStreamReader
116+
from energyml.utils.constants import gen_uuid
117+
import energyml.resqml.v2_2.resqmlv2 as resqml
118+
import energyml.eml.v2_3.commonv2 as eml
119+
120+
# Create a new EnergyML object
121+
boundary_feature = resqml.BoundaryFeature()
122+
boundary_feature.uuid = gen_uuid()
123+
boundary_feature.citation = eml.Citation(title="My Feature")
124+
125+
with EpcStreamReader('my_file.epc') as reader:
126+
# Add object - path is automatically generated based on EPC version
127+
identifier = reader.add_object(boundary_feature)
128+
print(f"Added object with identifier: {identifier}")
129+
130+
# Or specify custom path (optional)
131+
identifier = reader.add_object(boundary_feature, "custom/path/MyFeature.xml")
132+
```
133+
134+
### Removing Objects
135+
136+
```python
137+
with EpcStreamReader('my_file.epc') as reader:
138+
# Remove specific version by full identifier
139+
success = reader.remove_object("uuid.version")
140+
141+
# Remove ALL versions by UUID only
142+
success = reader.remove_object("12345678-1234-1234-1234-123456789abc")
143+
144+
if success:
145+
print("Object(s) removed successfully")
146+
```
147+
148+
### Updating Objects
149+
150+
```python
151+
...
152+
from energyml.utils.introspection import set_attribute_from_path
153+
154+
with EpcStreamReader('my_file.epc') as reader:
155+
# Get existing object
156+
obj = reader.get_object_by_identifier("uuid.version")
157+
158+
# Modify the object
159+
set_attribute_from_path(obj, "citation.title", "Updated Title")
160+
161+
# Update in EPC file
162+
new_identifier = reader.update_object(obj)
163+
print(f"Updated object: {new_identifier}")
164+
```
165+
166+
### Performance Monitoring
167+
168+
```python
169+
with EpcStreamReader('large_file.epc', cache_size=100) as reader:
170+
# Access some objects...
171+
for i in range(10):
172+
obj = reader.get_object_by_identifier(f"uuid-{i}.1")
173+
174+
# Check performance statistics
175+
print(f"Cache hit rate: {reader.stats.cache_hit_rate:.1f}%")
176+
print(f"Memory efficiency: {reader.stats.memory_efficiency:.1f}%")
177+
print(f"Objects in cache: {reader.stats.loaded_objects}/{reader.stats.total_objects}")
178+
```
179+
180+
### EPC Version Support
181+
182+
The EpcStreamReader automatically detects and handles both EPC packaging formats:
183+
184+
- **CLASSIC Format**: Flat file structure (e.g., `obj_BoundaryFeature_{uuid}.xml`)
185+
- **EXPANDED Format**: Namespace structure (e.g., `namespace_resqml201/version_{id}/obj_BoundaryFeature_{uuid}.xml` or `namespace_resqml201/obj_BoundaryFeature_{uuid}.xml`)
186+
187+
```python
188+
with EpcStreamReader('my_file.epc') as reader:
189+
print(f"Detected EPC version: {reader.export_version}")
190+
# Objects added will use the same format as the existing EPC file
191+
```
192+
193+
### Advanced Usage
194+
195+
```python
196+
# Initialize without preloading metadata for faster startup
197+
reader = EpcStreamReader('huge_file.epc', preload_metadata=False, cache_size=200)
198+
199+
try:
200+
# Manual metadata loading when needed
201+
reader._load_metadata()
202+
203+
# Get object dependencies
204+
deps = reader.get_object_dependencies("uuid.version")
205+
206+
# Batch processing with memory monitoring
207+
for obj_type in ["BoundaryFeature", "PropertyKind"]:
208+
objects = reader.get_objects_by_type(obj_type)
209+
print(f"Processing {len(objects)} {obj_type} objects")
210+
211+
finally:
212+
reader.close() # Manual cleanup if not using context manager
213+
```
214+
215+
The EpcStreamReader is perfect for applications that need to work with large EPC files efficiently, such as data processing pipelines, web applications, or analysis tools where memory usage is a concern.
216+
79217

80218
# Poetry scripts :
81219

@@ -95,25 +233,32 @@ energyml-prodml2-2 = "^1.12.0"
95233
poetry install
96234
```
97235

236+
if you fail to run a script, you may have to add "src" to your PYTHONPATH environment variable. For example, in powershell :
237+
238+
```powershell
239+
$env:PYTHONPATH="src"
240+
```
241+
98242

99243
## Validation examples :
100244

101245
An epc file:
102246
```bash
103-
poetry run validate --input "path/to/your/energyml/object.epc" *> output_logs.json
247+
poetry run validate --file "path/to/your/energyml/object.epc" *> output_logs.json
104248
```
105249

106250
An xml file:
107251
```bash
108-
poetry run validate --input "path/to/your/energyml/object.xml" *> output_logs.json
252+
poetry run validate --file "path/to/your/energyml/object.xml" *> output_logs.json
109253
```
110254

111255
A json file:
112256
```bash
113-
poetry run validate --input "path/to/your/energyml/object.json" *> output_logs.json
257+
poetry run validate --file "path/to/your/energyml/object.json" *> output_logs.json
114258
```
115259

116260
A folder containing Epc/xml/json files:
117261
```bash
118-
poetry run validate --input "path/to/your/folder" *> output_logs.json
262+
poetry run validate --file "path/to/your/folder" *> output_logs.json
119263
```
264+

energyml-utils/example/tools.py

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -6,13 +6,13 @@
66
import pathlib
77
from typing import Optional, List, Dict, Any
88

9-
from src.energyml.utils.validation import validate_epc
9+
from energyml.utils.validation import validate_epc
1010

11-
from src.energyml.utils.constants import get_property_kind_dict_path_as_xml
12-
from src.energyml.utils.data.datasets_io import CSVFileReader, HDF5FileWriter, ParquetFileWriter, DATFileReader
13-
from src.energyml.utils.data.mesh import MeshFileFormat, export_multiple_data, export_obj, read_mesh_object
14-
from src.energyml.utils.epc import Epc, gen_energyml_object_path
15-
from src.energyml.utils.introspection import (
11+
from energyml.utils.constants import get_property_kind_dict_path_as_xml
12+
from energyml.utils.data.datasets_io import CSVFileReader, HDF5FileWriter, ParquetFileWriter, DATFileReader
13+
from energyml.utils.data.mesh import MeshFileFormat, export_multiple_data, export_obj, read_mesh_object
14+
from energyml.utils.epc import Epc, gen_energyml_object_path
15+
from energyml.utils.introspection import (
1616
get_class_from_simple_name,
1717
get_module_name_and_type_from_content_or_qualified_type,
1818
random_value_from_class,
@@ -27,7 +27,7 @@
2727
get_class_from_qualified_type,
2828
get_object_attribute_or_create,
2929
)
30-
from src.energyml.utils.serialization import (
30+
from energyml.utils.serialization import (
3131
serialize_json,
3232
JSON_VERSION,
3333
serialize_xml,

energyml-utils/pyproject.toml

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -46,8 +46,12 @@ include = [
4646
# "src/energyml/main.py"
4747
#]
4848

49-
#[tool.pytest.ini_options]
50-
#pythonpath = [ "src" ]
49+
[tool.pytest.ini_options]
50+
pythonpath = [ "src" ]
51+
testpaths = [ "tests" ]
52+
python_files = [ "test_*.py", "*_test.py" ]
53+
python_classes = [ "Test*" ]
54+
python_functions = [ "test_*" ]
5155

5256
[tool.poetry.extras]
5357
parquet = ["pyarrow", "numpy", "pandas"]
@@ -61,7 +65,7 @@ h5py = { version = "^3.7.0", optional = false }
6165
pyarrow = { version = "^14.0.1", optional = false }
6266
numpy = { version = "^1.16.6", optional = false }
6367

64-
[poetry.group.dev.dependencies]
68+
[tool.poetry.group.dev.dependencies]
6569
pandas = { version = "^1.1.0", optional = false }
6670
coverage = {extras = ["toml"], version = "^6.2"}
6771
pytest = "^8.1.1"
@@ -83,6 +87,12 @@ energyml-witsml2-1 = "^1.12.0"
8387
energyml-prodml2-0 = "^1.12.0"
8488
energyml-prodml2-2 = "^1.12.0"
8589

90+
mypy = "^0.971"
91+
bandit = "^1.7.0"
92+
safety = "^1.10.0"
93+
memory-profiler = "^0.60.0"
94+
line-profiler = "^4.0.0"
95+
8696
[tool.coverage.run]
8797
branch = true
8898
source = ["src/energyml"]
98 KB
Binary file not shown.
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,5 @@
11
# Copyright (c) 2023-2024 Geosiris.
22
# SPDX-License-Identifier: Apache-2.0
3+
4+
# This is a namespace package
5+
__path__ = __import__("pkgutil").extend_path(__path__, __name__)

0 commit comments

Comments
 (0)