Skip to content

Commit c89cbb3

Browse files
committed
add docs for contributing datasets
- fix #46 - adapted from https://github.com/MDAnalysis/MDAnalysisData/wiki/add-new-data-set - update CHANGES
1 parent 56f9ae0 commit c89cbb3

4 files changed

Lines changed: 203 additions & 6 deletions

File tree

CHANGELOG.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
66

77
## [0.8.1] - YYYY-MM-DD
88

9+
### Added
10+
- docs for how to contribute a new dataset (#46)
11+
912
### Changes
1013
- update online docs theme (#43)
1114

docs/contributing.rst

Lines changed: 196 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,196 @@
1+
.. -*- coding: utf-8 -*-
2+
.. _contributing:
3+
4+
===========================
5+
Contributing new datasets
6+
===========================
7+
8+
New datasets are very welcome and everybody is encouraged to make their
9+
datasets accessible via :mod:`MDAnalysisData`, regardless of the simulation
10+
package or analysis code that they use. Users are encouraged to cite the
11+
authors of the datasets.
12+
13+
:mod:`MDAnalysisData` does *not* store files and trajectories. Instead, it
14+
provides accessor code to seamlessly download (and cache) files from archives.
15+
16+
17+
Outline
18+
=======
19+
20+
When you contribute data then you have to do two things
21+
22+
1. **deposit data in an archive** under an `Open Data`_ compatible license
23+
(`CC0`_ or `CC-BY`_ preferred)
24+
2. **write accessor code** in :mod:`MDAnalysisData`
25+
26+
The accessor code needs the stable archive URL(s) for your files and SHA256
27+
checksums to check the integrity for any downloaded files. You will also add
28+
a description of your dataset.
29+
30+
31+
.. note::
32+
33+
We currently have code to work with the `figshare`_ archive so choosing
34+
*figshare* will be easiest. But it should be straightforward to add code to
35+
work with other archive-grade repositories such as `zenodo`_ or
36+
`DataDryad`_. Some universities also provide digital repositories that are
37+
suitable. Open an issue in the `Issue Tracker`_ for supporting other
38+
archives.
39+
40+
41+
Step-by-step instructions
42+
=========================
43+
44+
To add a new dataset deposit your data in a repository. Then open a *pull
45+
request* for the https://github.com/MDAnalysis/MDAnalysisData
46+
repository. Follow these steps:
47+
48+
STEP 1: Archival deposition
49+
---------------------------
50+
51+
Deposit *all* required files in an archive-grade repository such as
52+
`figshare`_.
53+
54+
.. Note::
55+
56+
The site must *provide stable download links* and *may not change the
57+
content during download* because we store a SHA256 :ref:`checksum<checksum>`
58+
to check file integrity.
59+
60+
Make sure to **choose an** `Open Data`_ **compatible license** such as CC0_ or
61+
`CC-BY`_.
62+
63+
Take note of the **direct download URL** for each of your files. It should be
64+
possible to obtain the file directly from a stable URL with :program:`curl` or
65+
:program:`wget`. As an example look at the dataset for
66+
:mod:`MDAnalysisData.adk_equilibrium` at DOI `10.6084/m9.figshare.5108170`_ (as
67+
shown in the :ref:`figure below<fig-figshare-adk>`). Especially note the
68+
*download* links of the DCD trajectory
69+
(https://ndownloader.figshare.com/files/8672074) and PSF topology files
70+
(https://ndownloader.figshare.com/files/8672230) as these links will be needed
71+
in the accessor code in :mod:`MDAnalysisData` in the next step.
72+
73+
.. _fig-figshare-adk:
74+
75+
.. figure:: images/figshare_adk_equilibrium.png
76+
77+
The AdK Equilbrium dataset on figshare DOI `10.6084/m9.figshare.5108170`_,
78+
highlighting the deposited trajectory and topology files. The *download*
79+
URLs are visible when hovering over a file's image.
80+
81+
82+
.. _`10.6084/m9.figshare.5108170`:
83+
https://doi.org/10.6084/m9.figshare.5108170
84+
85+
86+
87+
STEP 2: Add code and docs to MDAnalysisData
88+
-------------------------------------------
89+
90+
91+
1. Add a Python module ``{MODULE_NAME}.py`` with the name of your dataset
92+
(where ``{MODULE_NAME}`` is just a placeholder). As an example see
93+
`MDAnalysisData/adk_equilibrium.py`_, which becomes
94+
:mod:`MDAnalysisData.adk_equilibrium`). In many cases you can copy an
95+
existing module and adapt:
96+
97+
- text: describe your dataset
98+
- :data:`NAME`: name of the data set; will be used as a file name so do not use spaces etc
99+
- :data:`DESCRIPTION`: filename of the description file (which contains
100+
restructured text format, so needs to have suffix ``.rst``)
101+
- :data:`ARCHIVE`: dictionary containing
102+
:class:`~MDAnalysisData.base.RemoteFileMetadata` instances. Keys should
103+
describe the file type. Typically
104+
105+
- *topology*: topology file (PSF, TPR, ...)
106+
- *trajectory*: trajectory coordinate file (DCD, XTC, ...)
107+
- *structure* (optional): system with single frame of coordinates
108+
(typically PDB, GRO, CRD, ...)
109+
110+
- name of the :func:`fetch_{NAME}` function (where ``{NAME}`` is a suitable
111+
name to access your dataset)
112+
- docs of the :func:`fetch_{NAME}` function
113+
- calculate and store the reference :ref:`SHA256 checksum <checksum>` as
114+
described below
115+
116+
2. Add a description file (example:
117+
`MDAnalysisData/descr/adk_equilibrium.rst`_); copy an existing file and
118+
adapt. **Make sure to add license information.**
119+
3. Import your :func:`fetch_{NAME}` function in
120+
`MDAnalysisData/datasets.py`_. ::
121+
122+
from .{MODULE_NAME} import fetch_{NAME}
123+
124+
4. Add documentation ``{NAME}.rst`` in restructured text format under `docs/`_
125+
(take existing files as examples) and append ``{NAME}`` to the second
126+
``toctree`` section of the `docs/index.rst`_ file.
127+
128+
.. code-block:: reST
129+
130+
.. toctree::
131+
:maxdepth: 1
132+
:caption: Datasets
133+
:hidden:
134+
135+
adk_equilibrium
136+
adk_transitions
137+
...
138+
CG_fiber
139+
{NAME}
140+
141+
If your data set does not follow the same pattern as the example above (where
142+
each file is downloaded separately) then you have to write your own
143+
:func:`fetch_{NAME}` function. E.g., you might download a tar file and then
144+
unpack the file yourself. Use scikit-learn's `sklearn/datasets`_ as examples,
145+
make sure that your function sets appropriate attributes in the returned
146+
:class:`~MDAnalysisData.base.Bunch` of records, and fully document what is
147+
returned.
148+
149+
150+
.. _checksum:
151+
152+
RemoteFileMetadata and SHA256 checksum
153+
======================================
154+
155+
The :class:`~MDAnalysisData.base.RemoteFileMetadata` is used by
156+
:func:`~MDAnalysisData.base._fetch_remote` and it will check file integrity by
157+
computing a SHA256 checksum over each downloaded file with a stored reference
158+
checksum. **You must compute the reference checksum and store it in your**
159+
:class:`~MDAnalysisData.base.RemoteFileMetadata` data structure for each file.
160+
161+
Typically you will have a local copy of the files during testing. You can
162+
compute the SHA256 for a file ``FILENAME`` with the following code::
163+
164+
python import MDAnalysisData.base
165+
print(MDAnalysisData.base._sha256(FILENAME))
166+
167+
or from the commandline
168+
169+
.. code-block:: bash
170+
171+
python -c 'import MDAnalysisData; print(MDAnalysisData.base._sha256("FILENAME"))'
172+
173+
where ``FILENAME`` is the file that is stored in the archive.
174+
175+
176+
.. references
177+
178+
.. _`Open Data`: https://opendatacommons.org/
179+
.. _CC0: https://creativecommons.org/share-your-work/public-domain/cc0
180+
.. _CC-BY: https://creativecommons.org/licenses/by/4.0/
181+
.. _figshare: (https://figshare.com/
182+
.. _zenodo: https://zenodo.org/
183+
.. _DataDryad: https://www.datadryad.org/
184+
.. _`Issue Tracker`: https://github.com/MDAnalysis/MDAnalysisData/issues
185+
.. _`MDAnalysisData/adk_equilibrium.py`:
186+
https://github.com/MDAnalysis/MDAnalysisData/blob/master/MDAnalysisData/adk_equilibrium.py
187+
.. _`MDAnalysisData/descr/adk_equilibrium.rst`:
188+
https://github.com/MDAnalysis/MDAnalysisData/blob/master/MDAnalysisData/descr/adk_equilibrium.rst
189+
.. _`MDAnalysisData/datasets.py`:
190+
https://github.com/MDAnalysis/MDAnalysisData/blob/master/MDAnalysisData/datasets.py
191+
.. _`docs/`:
192+
https://github.com/MDAnalysis/MDAnalysisData/blob/master/docs/
193+
.. _`docs/index.rst`:
194+
https://github.com/MDAnalysis/MDAnalysisData/blob/master/docs/index.rst
195+
.. _`sklearn/datasets`:
196+
https://github.com/scikit-learn/scikit-learn/tree/master/sklearn/datasets
80.6 KB
Loading

docs/index.rst

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -33,13 +33,12 @@ can be found in the public GitHub repository `mdanalysis/MDAnalysisData`_.
3333

3434
This library is *under active development*. We use `semantic
3535
versioning`_ to indicate clearly what kind of changes you may expect
36-
between releases. Please raise any issues or questions in the
37-
`Issue Tracker`_. `Contributions of data sets`_ and code in the form
38-
of pull requests are very welcome.
36+
between releases. Please raise any issues or questions in the `Issue
37+
Tracker`_. :ref:`Contributions of data sets <contributing>` and code
38+
in the form of pull requests are very welcome.
3939

4040
.. |zenodo| image:: https://zenodo.org/badge/147885122.svg
4141
:alt: Zenodo DOI
42-
:scale: 100%
4342
:target: https://zenodo.org/badge/latestdoi/147885122
4443

4544
.. |PRwelcome| image:: https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat-square
@@ -65,8 +64,6 @@ of pull requests are very welcome.
6564
.. _`semantic versioning`: https://semver.org
6665
.. _`Issue Tracker`:
6766
https://github.com/mdanalysis/MDAnalysisData/issues
68-
.. _`Contributions of data sets`:
69-
https://github.com/mdanalysis/MDAnalysisData/wiki/contributing
7067

7168

7269
.. toctree::
@@ -76,6 +73,7 @@ of pull requests are very welcome.
7673

7774
install
7875
usage
76+
contributing
7977
helpers
8078
credits
8179

0 commit comments

Comments
 (0)