@@ -69,33 +69,36 @@ Key concepts
6969~~~~~~~~~~~~
7070
7171OpenML contains several key concepts which it needs to make machine learning
72- research shareable. A machine learning experiment consists of several runs,
73- which describe the performance of an algorithm (called a flow in OpenML) on a
74- task. Task is the combination of a dataset, a split and an evaluation metric. In
75- this user guide we will go through listing and exploring existing tasks to
76- actually running machine learning algorithms on them. In a further user guide
77- we will examine how to search through datasets in order to curate a list of
78- tasks.
72+ research shareable. A machine learning experiment consists of one or several
73+ **runs **, which describe the performance of an algorithm (called a **flow ** in
74+ OpenML), its hyperparameter settings (called a **setup **) on a **task **. A
75+ **Task ** is the combination of a **dataset **, a split and an evaluation
76+ metric. In this user guide we will go through listing and exploring existing
77+ **tasks ** to actually running machine learning algorithms on them. In a further
78+ user guide we will examine how to search through **datasets ** in order to curate
79+ a list of **tasks **.
7980
8081~~~~~~~~~~~~~~~~~~
8182Working with tasks
8283~~~~~~~~~~~~~~~~~~
8384
84- Tasks are containers, defining how to split the dataset into a train and test
85- set, whether to use several disjoint train and test splits (cross-validation)
86- and whether this should be repeated several times. Also, the task defines a
87- target metric for which a flow should be optimized. You can think of a task as
88- an experimentation protocol, describing how to apply a machine learning model
89- to a dataset in a way that it is comparable with the results of others (more
90- on how to do that further down).
85+ You can think of a task as an experimentation protocol, describing how to apply
86+ a machine learning model to a dataset in a way that it is comparable with the
87+ results of others (more on how to do that further down).Tasks are containers,
88+ defining which dataset to use, what kind of task we're solving (regression,
89+ classification, clustering, etc...) and which column to predict. Furthermore,
90+ it also describes how to split the dataset into a train and test set, whether
91+ to use several disjoint train and test splits (cross-validation) and whether
92+ this should be repeated several times. Also, the task defines a target metric
93+ for which a flow should be optimized.
9194
9295Tasks are identified by IDs and can be accessed in two different ways:
9396
94971. In a list providing basic information on all tasks available on OpenML.
9598 This function will not download the actual tasks, but will instead download
9699 meta data that can be used to filter the tasks and retrieve a set of IDs.
97- We can filter this list, for example, we can only list
98- * supervised classification * tasks or tasks having a special tag .
100+ We can filter this list, for example, we can only list tasks having a special
101+ tag or only tasks for a specific target such as * supervised classification * .
99102
1001032. A single task by its ID. It contains all meta information, the target metric,
101104 the splits and an iterator which can be used to access the splits in a
@@ -132,29 +135,30 @@ to have better visualization and easier access:
132135 ' NumberOfSymbolicFeatures' , ' cost_matrix' ],
133136 dtype = ' object' )
134137
135- Now we can restrict the tasks to all tasks with the desired resampling strategy:
138+ We can filter the list of tasks to only contain datasets with more than
139+ 500 samples, but less than 1000 samples:
136140
137141.. code :: python
138142
139- >> > filtered_tasks = tasks.query(' estimation_procedure == "10-fold Crossvalidation" ' )
143+ >> > filtered_tasks = tasks.query(' NumberOfInstances > 500 and NumberOfInstances < 1000 ' )
140144 >> > print (list (filtered_tasks.index)) # doctest: +SKIP
141- [2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , ... , 146606 , 146607 , 146690 ]
142- >> > print (len (filtered_tasks)) # doctest: +SKIP
143- 1697
144-
145- Resampling strategies can be found on the `OpenML Website <http://www.openml.org/search?type=measure&q=estimation%20procedure >`_.
145+ [2 , 11 , 15 , 29 , 37 , 41 , 49 , 53 , ... , 146597 , 146600 , 146605 ]
146+ >> > print (len (filtered_tasks))
147+ 210
146148
147- We can further filter the list of tasks to only contain datasets with more than
148- 500 samples, but less than 1000 samples :
149+ Then, we can further restrict the tasks to all have the same resampling
150+ strategy :
149151
150152.. code :: python
151153
152- >> > filtered_tasks = filtered_tasks.query(' NumberOfInstances > 500 and NumberOfInstances < 1000 ' )
154+ >> > filtered_tasks = filtered_tasks.query(' estimation_procedure == "10-fold Crossvalidation" ' )
153155 >> > print (list (filtered_tasks.index)) # doctest: +SKIP
154156 [2 , 11 , 15 , 29 , 37 , 41 , 49 , 53 , ... , 146231 , 146238 , 146241 ]
155- >> > print (len (filtered_tasks))
157+ >> > print (len (filtered_tasks)) # doctest: +SKIP
156158 107
157159
160+ Resampling strategies can be found on the `OpenML Website <http://www.openml.org/search?type=measure&q=estimation%20procedure >`_.
161+
158162Similar to listing tasks by task type, we can list tasks by tags:
159163
160164.. code :: python
219223
220224.. code :: python
221225
222- >> > ids = [12 , 14 , 16 , 18 , 20 , 22 ]
226+ >> > ids = [2 , 11 , 15 , 29 , 37 , 41 , 49 , 53 ]
223227 >> > tasks = openml.tasks.get_tasks(ids)
224228 >> > pprint(tasks[0 ]) # doctest: +SKIP
225229
@@ -229,8 +233,9 @@ Creating runs
229233
230234In order to upload and share results of running a machine learning algorithm
231235on a task, we need to create an :class: `~openml.OpenMLRun `. A run object can
232- be created by running a :class: `~openml.OpenMLFlow ` or a scikit-learn model on
233- a task. We will focus on the simpler example of running a scikit-learn model.
236+ be created by running a :class: `~openml.OpenMLFlow ` or a scikit-learn compatible
237+ model on a task. We will focus on the simpler example of running a
238+ scikit-learn model.
234239
235240Flows are descriptions of something runable which does the machine learning.
236241A flow contains all information to set up the necessary machine learning
0 commit comments