Skip to content

Commit 8ee635b

Browse files
authored
Merge pull request #137 from rhiever/develop
Clean up usage docs
2 parents a2d0bc6 + f2adf46 commit 8ee635b

1 file changed

Lines changed: 28 additions & 28 deletions

File tree

doc/usage.rst

Lines changed: 28 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -13,18 +13,18 @@ Basic Usage
1313
***********
1414

1515
This document will guide you through the most important functions and classes
16-
in the OpenML python API. Throughout the document, we will use
16+
in the OpenML Python API. Throughout this document, we will use
1717
`pandas <http://pandas.pydata.org/>`_ to format and filter tables.
1818

1919
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2020
Connecting to the OpenML server
2121
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2222

23-
The OpenML server can only be accessed by users who have signed up to the OpenML
23+
The OpenML server can only be accessed by users who have signed up on the OpenML
2424
platform. If you don't have an account yet,
25-
`sign up now <http://openml.org/register>`_. You will receive an API key which
25+
`sign up now <http://openml.org/register>`_. You will receive an API key, which
2626
will authenticate you to the server and allow you to download and upload
27-
datasets, tasks, runs and flows. There are two ways of telling the API key
27+
datasets, tasks, runs and flows. There are two ways of providing the API key
2828
to the OpenML API package. The first option is to specify the API key
2929
programmatically after loading the package:
3030

@@ -35,17 +35,17 @@ programmatically after loading the package:
3535
>>> apikey = 'Your API key'
3636
>>> openml.apikey = apikey
3737
38-
The second option is to create a config like this:
38+
The second option is to create a config file:
3939

4040
.. code:: bash
4141
4242
apikey = qxlfpbeaudtprb23985hcqlfoebairtd
4343
44-
The config file has to be in the directory :bash:`~/.openml/config` and must
44+
The config file must be in the directory :bash:`~/.openml/config` and
4545
exist prior to importing the openml module.
4646

4747
When downloading datasets, tasks, runs and flows, they will be cached to
48-
retrieve them without calling the server later. As with the api key, the cache
48+
retrieve them without calling the server later. As with the API key, the cache
4949
directory can be either specified through the API or through the config file:
5050

5151
API:
@@ -69,17 +69,17 @@ Datasets are a key concept in OpenML (see `OpenML documentation <openml.org/guid
6969
Datasets are identified by IDs and can be accessed in two different ways:
7070

7171
1. In a list providing basic information on all datasets available on OpenML.
72-
This function will not download the actual dataset, but only very little
72+
This function will not download the actual dataset, but will instead download
7373
meta data which can be used to filter the datasets and retrieve a set of IDs.
74-
2. A single dataset by its ID. It contains all meta information and the actual
74+
2. A single dataset by its ID. A single dataset contains all meta information and the actual
7575
data in form of an .arff file. The .arff file will be converted into a numpy
76-
array by the OpenML python API.
76+
array by the OpenML Python API.
7777

7878
Listing datasets
7979
~~~~~~~~~~~~~~~~
8080

8181
A common task when using OpenML is to find a set of datasets which fulfill
82-
several criteria. They should for example have between 1.000 and 10.000
82+
several criteria. They should for example have between 1,000 and 10,000
8383
data points and at least five features.
8484

8585
.. code:: python
@@ -137,7 +137,7 @@ and can see the first data point:
137137
We can now filter the data:
138138

139139
>>> filter = (datasets.NumberOfInstances > 1000) & (datasets.NumberOfFeatures > 5)
140-
>>> filtered_datasets = datasets[filter]
140+
>>> filtered_datasets = datasets.loc[filter]
141141
>>> dataset_indices = list(filtered_datasets.index)
142142
>>> print(dataset_indices) # doctest: +SKIP
143143
[3, 6, 12, 14, 16, 18, 20, 21, 22, 23, 24, 26, 28, 30, 32, 36, 38, 44,
@@ -164,11 +164,11 @@ Properties of the dataset are stored as member variables:
164164
>>> print(dataset.__dict__) # doctest: +SKIP
165165
{'upload_date': u'2014-04-06 23:21:03', 'md5_cheksum': u'3149646ecff276abac3e892d1556655f', 'creator': None, 'citation': None, 'tag': [u'study_1', u'study_7', u'uci'], 'version_label': u'1', 'contributor': None, 'paper_url': None, 'original_data_url': None, 'id': 23, 'collection_date': None, 'row_id_attribute': None, 'version': 1, 'data_pickle_file': '/home/matthias/.openml/cache/datasets/23/dataset.pkl', 'default_target_attribute': u'Contraceptive_method_used', 'description': u"**Author**: \n**Source**: Unknown - \n**Please cite**: \n\n1. Title: Contraceptive Method Choice\n \n 2. Sources:\n (a) Origin: This dataset is a subset of the 1987 National Indonesia\n Contraceptive Prevalence Survey\n (b) Creator: Tjen-Sien Lim (limt@stat.wisc.edu)\n (c) Donor: Tjen-Sien Lim (limt@stat.wisc.edu)\n (c) Date: June 7, 1997\n \n 3. Past Usage:\n Lim, T.-S., Loh, W.-Y. & Shih, Y.-S. (1999). A Comparison of\n Prediction Accuracy, Complexity, and Training Time of Thirty-three\n Old and New Classification Algorithms. Machine Learning. Forthcoming.\n (ftp://ftp.stat.wisc.edu/pub/loh/treeprogs/quest1.7/mach1317.pdf or\n (http://www.stat.wisc.edu/~limt/mach1317.pdf)\n \n 4. Relevant Information:\n This dataset is a subset of the 1987 National Indonesia Contraceptive\n Prevalence Survey. The samples are married women who were either not \n pregnant or do not know if they were at the time of interview. The \n problem is to predict the current contraceptive method choice \n (no use, long-term methods, or short-term methods) of a woman based \n on her demographic and socio-economic characteristics.\n \n 5. Number of Instances: 1473\n \n 6. Number of Attributes: 10 (including the class attribute)\n \n 7. Attribute Information:\n \n 1. Wife's age (numerical)\n 2. Wife's education (categorical) 1=low, 2, 3, 4=high\n 3. Husband's education (categorical) 1=low, 2, 3, 4=high\n 4. Number of children ever born (numerical)\n 5. Wife's religion (binary) 0=Non-Islam, 1=Islam\n 6. Wife's now working? (binary) 0=Yes, 1=No\n 7. Husband's occupation (categorical) 1, 2, 3, 4\n 8. Standard-of-living index (categorical) 1=low, 2, 3, 4=high\n 9. Media exposure (binary) 0=Good, 1=Not good\n 10. Contraceptive method used (class attribute) 1=No-use \n 2=Long-term\n 3=Short-term\n \n 8. Missing Attribute Values: None\n\n Information about the dataset\n CLASSTYPE: nominal\n CLASSINDEX: last", 'format': u'ARFF', 'visibility': u'public', 'update_comment': None, 'licence': u'Public', 'name': u'cmc', 'language': None, 'url': u'http://www.openml.org/data/download/23/dataset_23_cmc.arff', 'data_file': '~/.openml/cache/datasets/23/dataset.arff', 'ignore_attributes': None}
166166
167-
Then, to obtain the data matrix:
167+
Next, to obtain the data matrix:
168168

169169
.. code:: python
170170
171-
>>> X = dataset.get_dataset()
171+
>>> X = dataset.get_data()
172172
>>> print(X.shape, X.dtype)
173173
((1473, 10), dtype('float32'))
174174
@@ -178,21 +178,21 @@ variables are encoded as integers, the inverse encoding can be retrieved via:
178178

179179
.. code:: python
180180
181-
>>> X, names = dataset.get_dataset(return_attribute_names=True)
181+
>>> X, names = dataset.get_data(return_attribute_names=True)
182182
>>> print(names)
183183
[u'Wifes_age', u'Wifes_education', u'Husbands_education', u'Number_of_children_ever_born', u'Wifes_religion', u'Wifes_now_working%3F', u'Husbands_occupation', u'Standard-of-living_index', u'Media_exposure', u'Contraceptive_method_used']
184184
185-
Most times, having a single data matrix :python:`X` is not enough. Two very
185+
Most times, having a single data matrix :python:`X` is not enough. Two
186186
useful arguments are :python:`target` and
187187
:python:`return_categorical_indicator`. :python:`target` makes
188-
:meth:`openml.datasets.get_dataset()` return :python:`X` and :python:`y`
188+
:meth:`get_data()` return :python:`X` and :python:`y`
189189
seperate; :python:`return_categorical_indicator` makes
190-
:meth:`openml.datasets.get_dataset()` return a boolean array which indicate
190+
:meth:`get_data()` return a boolean array which indicate
191191
which attributes are categorical (and should be one hot encoded if necessary.)
192192

193193
.. code:: python
194194
195-
>>> X, y, categorical = dataset.get_dataset(
195+
>>> X, y, categorical = dataset.get_data(
196196
... target=dataset.default_target_attribute,
197197
... return_categorical_indicator=True)
198198
>>> print(X.shape, y.shape)
@@ -222,7 +222,7 @@ In case you are working with `scikit-learn
222222
warm_start=False)
223223
224224
When you have to retrieve several datasets, you can use the convenience function
225-
:meth:`openml.datasets.get_datasets()` which downloads all datasets given by
225+
:meth:`openml.datasets.get_datasets()`, which downloads all datasets given by
226226
a list of IDs:
227227

228228
>>> ids = [12, 14, 16, 18, 20, 22]
@@ -248,8 +248,8 @@ Just like datasets, tasks are identified by IDs and can be accessed in three
248248
different ways:
249249

250250
1. In a list providing basic information on all tasks available on OpenML.
251-
This function will not download the actual tasks, but only very little
252-
meta data which can be used to filter the tasks and retrieve a set of IDs.
251+
This function will not download the actual tasks, but will instead download
252+
meta data that can be used to filter the tasks and retrieve a set of IDs.
253253
2. By functions only list a subset of all available tasks, restricted either by
254254
their :TODO:`task_type`, :TODO:`tag` or :TODO:`check_for_more`.
255255
3. A single task by its ID. It contains all meta information, the target metric,
@@ -261,19 +261,19 @@ You can also read more about tasks in the `OpenML guide <http://www.openml.org/g
261261
Listing tasks
262262
~~~~~~~~~~~~~
263263

264-
Once we figured out the datasets we want to work on, we have to download the
264+
Once we decide on the datasets we want to work on, we have to download the
265265
corresponding tasks. Tasks can be pre-filtered by either by a task type or
266266
a tag.
267267

268-
So far, this package only supports the tasks supervised classification (task
269-
type :python:`1`) and supervised regression (task type :python:`2`) #TODO check this
268+
So far, this package only supports supervised classification tasks (task
269+
type :python:`1`) and supervised regression tasks (task type :python:`2`) #TODO check this
270270
We desribe how to find other task types in the subsection `Finding out task types`_
271-
and are very happy about contributions which help us to support all other task
271+
and are happy to receive contributions that help us to support all other task
272272
types.
273273

274274
The most natural way to retrieve tasks is by their task type. In this example
275-
we will use the most commonly studied machine learning task supervised
276-
classification (task type :python:`1`):
275+
we will use the most commonly studied machine learning supervised
276+
classification task (task type :python:`1`):
277277

278278
.. code:: python
279279

0 commit comments

Comments
 (0)