Skip to content

Commit 4181c4a

Browse files
committed
include suggestions from @amueller
1 parent 311c861 commit 4181c4a

2 files changed

Lines changed: 36 additions & 31 deletions

File tree

doc/usage.rst

Lines changed: 35 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -69,33 +69,36 @@ Key concepts
6969
~~~~~~~~~~~~
7070

7171
OpenML contains several key concepts which it needs to make machine learning
72-
research shareable. A machine learning experiment consists of several runs,
73-
which describe the performance of an algorithm (called a flow in OpenML) on a
74-
task. Task is the combination of a dataset, a split and an evaluation metric. In
75-
this user guide we will go through listing and exploring existing tasks to
76-
actually running machine learning algorithms on them. In a further user guide
77-
we will examine how to search through datasets in order to curate a list of
78-
tasks.
72+
research shareable. A machine learning experiment consists of one or several
73+
**runs**, which describe the performance of an algorithm (called a **flow** in
74+
OpenML), its hyperparameter settings (called a **setup**) on a **task**. A
75+
**Task** is the combination of a **dataset**, a split and an evaluation
76+
metric. In this user guide we will go through listing and exploring existing
77+
**tasks** to actually running machine learning algorithms on them. In a further
78+
user guide we will examine how to search through **datasets** in order to curate
79+
a list of **tasks**.
7980

8081
~~~~~~~~~~~~~~~~~~
8182
Working with tasks
8283
~~~~~~~~~~~~~~~~~~
8384

84-
Tasks are containers, defining how to split the dataset into a train and test
85-
set, whether to use several disjoint train and test splits (cross-validation)
86-
and whether this should be repeated several times. Also, the task defines a
87-
target metric for which a flow should be optimized. You can think of a task as
88-
an experimentation protocol, describing how to apply a machine learning model
89-
to a dataset in a way that it is comparable with the results of others (more
90-
on how to do that further down).
85+
You can think of a task as an experimentation protocol, describing how to apply
86+
a machine learning model to a dataset in a way that it is comparable with the
87+
results of others (more on how to do that further down).Tasks are containers,
88+
defining which dataset to use, what kind of task we're solving (regression,
89+
classification, clustering, etc...) and which column to predict. Furthermore,
90+
it also describes how to split the dataset into a train and test set, whether
91+
to use several disjoint train and test splits (cross-validation) and whether
92+
this should be repeated several times. Also, the task defines a target metric
93+
for which a flow should be optimized.
9194

9295
Tasks are identified by IDs and can be accessed in two different ways:
9396

9497
1. In a list providing basic information on all tasks available on OpenML.
9598
This function will not download the actual tasks, but will instead download
9699
meta data that can be used to filter the tasks and retrieve a set of IDs.
97-
We can filter this list, for example, we can only list
98-
*supervised classification* tasks or tasks having a special tag.
100+
We can filter this list, for example, we can only list tasks having a special
101+
tag or only tasks for a specific target such as *supervised classification*.
99102

100103
2. A single task by its ID. It contains all meta information, the target metric,
101104
the splits and an iterator which can be used to access the splits in a
@@ -132,29 +135,30 @@ to have better visualization and easier access:
132135
'NumberOfSymbolicFeatures', 'cost_matrix'],
133136
dtype='object')
134137
135-
Now we can restrict the tasks to all tasks with the desired resampling strategy:
138+
We can filter the list of tasks to only contain datasets with more than
139+
500 samples, but less than 1000 samples:
136140

137141
.. code:: python
138142
139-
>>> filtered_tasks = tasks.query('estimation_procedure == "10-fold Crossvalidation"')
143+
>>> filtered_tasks = tasks.query('NumberOfInstances > 500 and NumberOfInstances < 1000')
140144
>>> print(list(filtered_tasks.index)) # doctest: +SKIP
141-
[2, 3, 4, 5, 6, 7, 8, 9, ..., 146606, 146607, 146690]
142-
>>> print(len(filtered_tasks)) # doctest: +SKIP
143-
1697
144-
145-
Resampling strategies can be found on the `OpenML Website <http://www.openml.org/search?type=measure&q=estimation%20procedure>`_.
145+
[2, 11, 15, 29, 37, 41, 49, 53, ..., 146597, 146600, 146605]
146+
>>> print(len(filtered_tasks))
147+
210
146148
147-
We can further filter the list of tasks to only contain datasets with more than
148-
500 samples, but less than 1000 samples:
149+
Then, we can further restrict the tasks to all have the same resampling
150+
strategy:
149151

150152
.. code:: python
151153
152-
>>> filtered_tasks = filtered_tasks.query('NumberOfInstances > 500 and NumberOfInstances < 1000')
154+
>>> filtered_tasks = filtered_tasks.query('estimation_procedure == "10-fold Crossvalidation"')
153155
>>> print(list(filtered_tasks.index)) # doctest: +SKIP
154156
[2, 11, 15, 29, 37, 41, 49, 53, ..., 146231, 146238, 146241]
155-
>>> print(len(filtered_tasks))
157+
>>> print(len(filtered_tasks)) # doctest: +SKIP
156158
107
157159
160+
Resampling strategies can be found on the `OpenML Website <http://www.openml.org/search?type=measure&q=estimation%20procedure>`_.
161+
158162
Similar to listing tasks by task type, we can list tasks by tags:
159163

160164
.. code:: python
@@ -219,7 +223,7 @@ And:
219223

220224
.. code:: python
221225
222-
>>> ids = [12, 14, 16, 18, 20, 22]
226+
>>> ids = [2, 11, 15, 29, 37, 41, 49, 53]
223227
>>> tasks = openml.tasks.get_tasks(ids)
224228
>>> pprint(tasks[0]) # doctest: +SKIP
225229
@@ -229,8 +233,9 @@ Creating runs
229233

230234
In order to upload and share results of running a machine learning algorithm
231235
on a task, we need to create an :class:`~openml.OpenMLRun`. A run object can
232-
be created by running a :class:`~openml.OpenMLFlow` or a scikit-learn model on
233-
a task. We will focus on the simpler example of running a scikit-learn model.
236+
be created by running a :class:`~openml.OpenMLFlow` or a scikit-learn compatible
237+
model on a task. We will focus on the simpler example of running a
238+
scikit-learn model.
234239

235240
Flows are descriptions of something runable which does the machine learning.
236241
A flow contains all information to set up the necessary machine learning

openml/datasets/dataset.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -242,7 +242,7 @@ def get_data(self, target=None, target_dtype=None, include_row_id=False,
242242
else:
243243
if isinstance(target, six.string_types):
244244
target = [target]
245-
legal_target_types = (int, float)
245+
legal_target_types = (int, float, np.float32, np.float64)
246246
if target_dtype not in legal_target_types:
247247
raise ValueError(
248248
"%s is not a legal target type. Legal target types are %s" %

0 commit comments

Comments
 (0)