Skip to content

Commit 94102f3

Browse files
glemaitremfeurer
authored andcommitted
[MRG] EHN: Add support for pandas DataFrame and SparseDataFrame when loading (#548)
* EHN: add support for DataFrame when loading dataset * MAINT: add pandas as dependency * FIX: typo in setup * TST: add unit test for checking pandas and numpy * FIX: back-compatibility defaulting on float 32 * PEP8 * FIX: transform y to integer if a category for back-compat * PEP8 * DOC: add example * TST: remove useless tests * iter * iter * iter * EHN: partially address mfeurer comments * FIX: append column and concat * simplify * FIX: add back missing test files * CLEAN: remove new useless pkl * FIX: revert backward compatibility * PEP8 * PEP8 * fix * TST: ensure behavior of ignore_attribute * TST: add test for SparseDataFrame * raise FutureWarning and avoid warning in testing * EHN: interpret propely the boolean type * FIX typo * PEP8 * MAINT: show slowest tests * FIX: avoid reallocation in a loop with pandas * fix typo * fixes
1 parent 9c74931 commit 94102f3

7 files changed

Lines changed: 433 additions & 111 deletions

File tree

ci_scripts/test.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ run_tests() {
2222
PYTEST_ARGS=''
2323
fi
2424

25-
pytest -n 4 --timeout=600 --timeout-method=thread -sv --ignore='test_OpenMLDemo.py' $PYTEST_ARGS $test_dir
25+
pytest -n 4 --duration=20 --timeout=600 --timeout-method=thread -sv --ignore='test_OpenMLDemo.py' $PYTEST_ARGS $test_dir
2626
}
2727

2828
if [[ "$RUN_FLAKE8" == "true" ]]; then

examples/datasets_tutorial.py

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -55,16 +55,28 @@
5555
############################################################################
5656
# Get the actual data.
5757
#
58-
# Returned as numpy array, with meta-info
59-
# (e.g. target feature, feature names, ...)
58+
# The dataset can be returned in 2 possible formats: as a NumPy array, a SciPy
59+
# sparse matrix, or as a Pandas DataFrame (or SparseDataFrame). The format is
60+
# controlled with the parameter ``dataset_format`` which can be either 'array'
61+
# (default) or 'dataframe'. Let's first build our dataset from a NumPy array
62+
# and manually create a dataframe.
6063
X, y, attribute_names = dataset.get_data(
64+
dataset_format='array',
6165
target=dataset.default_target_attribute,
6266
return_attribute_names=True,
6367
)
6468
eeg = pd.DataFrame(X, columns=attribute_names)
6569
eeg['class'] = y
6670
print(eeg[:10])
6771

72+
############################################################################
73+
# Instead of manually creating the dataframe, you can already request a
74+
# dataframe with the correct dtypes.
75+
X, y = dataset.get_data(target=dataset.default_target_attribute,
76+
dataset_format='dataframe')
77+
print(X.head())
78+
print(X.info())
79+
6880
############################################################################
6981
# Exercise 2
7082
# **********

examples/flows_and_runs_tutorial.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717

1818
dataset = openml.datasets.get_dataset(68)
1919
X, y = dataset.get_data(
20+
dataset_format='array',
2021
target=dataset.default_target_attribute
2122
)
2223
clf = neighbors.KNeighborsClassifier(n_neighbors=1)
@@ -28,6 +29,7 @@
2829
# * e.g. categorical features -> do feature encoding
2930
dataset = openml.datasets.get_dataset(17)
3031
X, y, categorical = dataset.get_data(
32+
dataset_format='array',
3133
target=dataset.default_target_attribute,
3234
return_categorical_indicator=True,
3335
)

0 commit comments

Comments
 (0)