Skip to content

Commit 0f8b7f0

Browse files
authored
Extension interface (#647)
* draft extensions interface * Change to new advised style of defining abstract base class. * incorporate @pgijbers' feedback * incorporate Jan's comments * (hopefully) make the tests run again * make more tests work again * fix more tests? * Move all files for the sklearn converter to a single location * fix tests * TST fix function call * slight reorganization of the files * TST fix wrong path * TST fix wrong path * MAINT add type hints to all methods touched in this PR * factor a lot of extension functions to new file * fix a few broken tests * rename test files to reflect previous refactor * fix unit tests * fix unit tests * add extension plugin mechanism * pep8 & mypy * save docstring progress * fix? * finish docstrings & simplify interface * add extension interface to documentation * PEP8 & doc building * Address comments by Jan and Pieter * progress dump * tests, pep8, shuffle functions and tests around
1 parent 7ec429e commit 0f8b7f0

36 files changed

Lines changed: 3177 additions & 2275 deletions

ci_scripts/flake8_diff.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
11
#!/bin/bash
22

33
flake8 --ignore E402,W503 --show-source --max-line-length 100 $options
4+
mypy openml --ignore-missing-imports --follow-imports skip

ci_scripts/install.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ if [[ "$COVERAGE" == "true" ]]; then
4040
pip install codecov pytest-cov
4141
fi
4242
if [[ "$RUN_FLAKE8" == "true" ]]; then
43-
pip install flake8
43+
pip install flake8 mypy
4444
fi
4545

4646
python --version

doc/api.rst

Lines changed: 26 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,32 @@ Top-level Classes
2020
OpenMLFlow
2121
OpenMLEvaluation
2222

23+
.. _api_extensions:
24+
25+
Extensions
26+
----------
27+
28+
.. currentmodule:: openml.extensions
29+
30+
.. autosummary::
31+
:toctree: generated/
32+
:template: class.rst
33+
34+
Extension
35+
sklearn.SklearnExtension
36+
37+
.. currentmodule:: openml.extensions
38+
39+
.. autosummary::
40+
:toctree: generated/
41+
:template: function.rst
42+
43+
register_extension
44+
get_extension_by_model
45+
get_extension_by_flow
46+
47+
Modules
48+
-------
2349

2450
:mod:`openml.datasets`: Dataset Functions
2551
-----------------------------------------
@@ -55,10 +81,8 @@ Top-level Classes
5581
:template: function.rst
5682

5783
flow_exists
58-
flow_to_sklearn
5984
get_flow
6085
list_flows
61-
sklearn_to_flow
6286

6387
:mod:`openml.runs`: Run Functions
6488
----------------------------------
@@ -112,5 +136,3 @@ Top-level Classes
112136
get_tasks
113137
list_tasks
114138

115-
116-

doc/contributing.rst

Lines changed: 2 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -106,17 +106,13 @@ From within the directory of the cloned package, execute:
106106
107107
pytest tests/
108108
109-
.. _extending:
110-
111-
Executing a specific test can be done by specifying the module, test case, and test.
109+
Executing a specific test can be done by specifying the module, test case, and test.
112110
To obtain a hierarchical list of all tests, run
113111

114112
.. code:: bash
115113
116114
pytest --collect-only
117115
118-
.. _extending:
119-
120116
.. code:: bash
121117
122118
<Module 'tests/test_datasets/test_dataset.py'>
@@ -129,33 +125,26 @@ To obtain a hierarchical list of all tests, run
129125
<TestCaseFunction 'test_get_data_with_target'>
130126
<UnitTestCase 'OpenMLDatasetTestOnTestServer'>
131127
<TestCaseFunction 'test_tagging'>
132-
133-
.. _extending:
128+
134129
135130
To run a specific module, add the module name, for instance:
136131

137132
.. code:: bash
138133
139134
pytest tests/test_datasets/test_dataset.py
140135
141-
.. _extending:
142-
143136
To run a specific unit test case, add the test case name, for instance:
144137

145138
.. code:: bash
146139
147140
pytest tests/test_datasets/test_dataset.py::OpenMLDatasetTest
148141
149-
.. _extending:
150-
151142
To run a specific unit test, add the test name, for instance:
152143

153144
.. code:: bash
154145
155146
pytest tests/test_datasets/test_dataset.py::OpenMLDatasetTest::test_get_data
156147
157-
.. _extending:
158-
159148
Happy testing!
160149

161150

doc/usage.rst

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -116,6 +116,13 @@ obtained on. Learn how to share your datasets in the following tutorial:
116116

117117
* `Upload a dataset <examples/create_upload_tutorial.html>`_
118118

119+
~~~~~~~~~~~~~~~~~~~~~~~
120+
Extending OpenML-Python
121+
~~~~~~~~~~~~~~~~~~~~~~~
122+
123+
OpenML-Python provides an extension interface to connect other machine learning libraries than
124+
scikit-learn to OpenML. Please check the :ref:`api_extensions` and use the
125+
scikit-learn extension in :class:`openml.extensions.sklearn.SklearnExtension` as a starting point.
119126

120127
~~~~~~~~~~~~~~~
121128
Advanced topics

examples/flows_and_runs_tutorial.py

Lines changed: 19 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -49,11 +49,8 @@
4949
# Build any classifier or pipeline
5050
clf = tree.ExtraTreeClassifier()
5151

52-
# Create a flow
53-
flow = openml.flows.sklearn_to_flow(clf)
54-
5552
# Run the flow
56-
run = openml.runs.run_flow_on_task(flow, task)
53+
run = openml.runs.run_model_on_task(clf, task)
5754

5855
# pprint(vars(run), depth=2)
5956

@@ -85,9 +82,8 @@
8582
('OneHotEncoder', preprocessing.OneHotEncoder(sparse=False, handle_unknown='ignore')),
8683
('Classifier', ensemble.RandomForestClassifier())
8784
])
88-
flow = openml.flows.sklearn_to_flow(pipe)
8985

90-
run = openml.runs.run_flow_on_task(flow, task, avoid_duplicate_runs=False)
86+
run = openml.runs.run_model_on_task(pipe, task, avoid_duplicate_runs=False)
9187
myrun = run.publish()
9288
print("Uploaded to http://test.openml.org/r/" + str(myrun.run_id))
9389

@@ -118,6 +114,22 @@
118114
# Publishing the run will automatically upload the related flow if
119115
# it does not yet exist on the server.
120116

117+
############################################################################
118+
# Alternatively, one can also directly run flows.
119+
120+
# Get a task
121+
task = openml.tasks.get_task(403)
122+
123+
# Build any classifier or pipeline
124+
clf = tree.ExtraTreeClassifier()
125+
126+
# Obtain the scikit-learn extension interface to convert the classifier
127+
# into a flow object.
128+
extension = openml.extensions.get_extension_by_model(clf)
129+
flow = extension.model_to_flow(clf)
130+
131+
run = openml.runs.run_flow_on_task(flow, task)
132+
121133
############################################################################
122134
# Challenge
123135
# ^^^^^^^^^
@@ -142,8 +154,7 @@
142154
task = openml.tasks.get_task(task_id)
143155
data = openml.datasets.get_dataset(task.dataset_id)
144156
clf = neighbors.KNeighborsClassifier(n_neighbors=5)
145-
flow = openml.flows.sklearn_to_flow(clf)
146157

147-
run = openml.runs.run_flow_on_task(flow, task, avoid_duplicate_runs=False)
158+
run = openml.runs.run_model_on_task(clf, task, avoid_duplicate_runs=False)
148159
myrun = run.publish()
149160
print("kNN on %s: http://test.openml.org/r/%d" % (data.name, myrun.run_id))

examples/introduction_tutorial.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -77,8 +77,7 @@
7777
task = openml.tasks.get_task(403)
7878
data = openml.datasets.get_dataset(task.dataset_id)
7979
clf = neighbors.KNeighborsClassifier(n_neighbors=5)
80-
flow = openml.flows.sklearn_to_flow(clf)
81-
run = openml.runs.run_flow_on_task(flow, task, avoid_duplicate_runs=False)
80+
run = openml.runs.run_model_on_task(clf, task, avoid_duplicate_runs=False)
8281
# Publish the experiment on OpenML (optional, requires an API key).
8382
# For this tutorial, our configuration publishes to the test server
8483
# as to not pollute the main server.

openml/__init__.py

Lines changed: 53 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -14,23 +14,36 @@
1414
(`REST on wikipedia
1515
<http://en.wikipedia.org/wiki/Representational_state_transfer>`_).
1616
"""
17-
from . import config
1817

18+
from . import _api_calls
19+
from . import config
1920
from .datasets import OpenMLDataset, OpenMLDataFeature
2021
from . import datasets
22+
from . import evaluations
23+
from .evaluations import OpenMLEvaluation
24+
from . import extensions
25+
from . import exceptions
2126
from . import tasks
27+
from .tasks import (
28+
OpenMLTask,
29+
OpenMLSplit,
30+
OpenMLSupervisedTask,
31+
OpenMLClassificationTask,
32+
OpenMLRegressionTask,
33+
OpenMLClusteringTask,
34+
OpenMLLearningCurveTask,
35+
)
2236
from . import runs
23-
from . import flows
24-
from . import setups
25-
from . import evaluations
26-
2737
from .runs import OpenMLRun
28-
from .tasks import OpenMLTask, OpenMLSplit
38+
from . import flows
2939
from .flows import OpenMLFlow
30-
from .evaluations import OpenMLEvaluation
40+
from . import setups
41+
from . import study
3142
from .study import OpenMLStudy
43+
from . import utils
44+
3245

33-
from .__version__ import __version__ # noqa: F401
46+
from .__version__ import __version__
3447

3548

3649
def populate_cache(task_ids=None, dataset_ids=None, flow_ids=None,
@@ -69,7 +82,35 @@ def populate_cache(task_ids=None, dataset_ids=None, flow_ids=None,
6982
runs.functions.get_run(run_id)
7083

7184

72-
__all__ = ['OpenMLDataset', 'OpenMLDataFeature', 'OpenMLRun',
73-
'OpenMLSplit', 'OpenMLEvaluation', 'OpenMLSetup',
74-
'OpenMLTask', 'OpenMLFlow', 'OpenMLStudy', 'datasets',
75-
'evaluations', 'config', 'runs', 'flows', 'tasks', 'setups']
85+
__all__ = [
86+
'OpenMLDataset',
87+
'OpenMLDataFeature',
88+
'OpenMLRun',
89+
'OpenMLSplit',
90+
'OpenMLEvaluation',
91+
'OpenMLSetup',
92+
'OpenMLTask',
93+
'OpenMLSupervisedTask',
94+
'OpenMLClusteringTask',
95+
'OpenMLLearningCurveTask',
96+
'OpenMLRegressionTask',
97+
'OpenMLClassificationTask',
98+
'OpenMLFlow',
99+
'OpenMLStudy',
100+
'datasets',
101+
'evaluations',
102+
'exceptions',
103+
'extensions',
104+
'config',
105+
'runs',
106+
'flows',
107+
'tasks',
108+
'setups',
109+
'study',
110+
'utils',
111+
'_api_calls',
112+
'__version__',
113+
]
114+
115+
# Load the scikit-learn extension by default
116+
import openml.extensions.sklearn # noqa: F401

openml/config.py

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -28,13 +28,14 @@
2828

2929
# Default values are actually added here in the _setup() function which is
3030
# called at the end of this module
31-
server = ""
32-
apikey = ""
31+
server = _defaults['server']
32+
apikey = _defaults['apikey']
3333
# The current cache directory (without the server name)
34-
cache_directory = ""
34+
cache_directory = _defaults['cachedir']
35+
avoid_duplicate_runs = True if _defaults['avoid_duplicate_runs'] == 'True' else False
3536

3637
# Number of retries if the connection breaks
37-
connection_n_retries = 2
38+
connection_n_retries = _defaults['connection_n_retries']
3839

3940

4041
def _setup():

openml/datasets/functions.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -511,8 +511,9 @@ def create_dataset(name, description, creator, contributor,
511511
specified, the index of the dataframe will be used as the
512512
``row_id_attribute``. If the name of the index is ``None``, it will
513513
be discarded.
514+
514515
.. versionadded: 0.8
515-
Inference of ``row_id_attribute`` from a dataframe.
516+
Inference of ``row_id_attribute`` from a dataframe.
516517
original_data_url : str, optional
517518
For derived data, the url to the original dataset.
518519
paper_url : str, optional

0 commit comments

Comments
 (0)