|
| 1 | +.. -*- coding: utf-8 -*- |
| 2 | +.. _contributing: |
| 3 | + |
| 4 | +=========================== |
| 5 | + Contributing new datasets |
| 6 | +=========================== |
| 7 | + |
| 8 | +New datasets are very welcome and everybody is encouraged to make their |
| 9 | +datasets accessible via :mod:`MDAnalysisData`, regardless of the simulation |
| 10 | +package or analysis code that they use. Users are encouraged to cite the |
| 11 | +authors of the datasets. |
| 12 | + |
| 13 | +:mod:`MDAnalysisData` does *not* store files and trajectories. Instead, it |
| 14 | +provides accessor code to seamlessly download (and cache) files from archives. |
| 15 | + |
| 16 | + |
| 17 | +Outline |
| 18 | +======= |
| 19 | + |
| 20 | +When you contribute data then you have to do two things |
| 21 | + |
| 22 | +1. **deposit data in an archive** under an `Open Data`_ compatible license |
| 23 | + (`CC0`_ or `CC-BY`_ preferred) |
| 24 | +2. **write accessor code** in :mod:`MDAnalysisData` |
| 25 | + |
| 26 | + The accessor code needs the stable archive URL(s) for your files and SHA256 |
| 27 | + checksums to check the integrity for any downloaded files. You will also add |
| 28 | + a description of your dataset. |
| 29 | + |
| 30 | + |
| 31 | +.. note:: |
| 32 | + |
| 33 | + We currently have code to work with the `figshare`_ archive so choosing |
| 34 | + *figshare* will be easiest. But it should be straightforward to add code to |
| 35 | + work with other archive-grade repositories such as `zenodo`_ or |
| 36 | + `DataDryad`_. Some universities also provide digital repositories that are |
| 37 | + suitable. Open an issue in the `Issue Tracker`_ for supporting other |
| 38 | + archives. |
| 39 | + |
| 40 | + |
| 41 | +Step-by-step instructions |
| 42 | +========================= |
| 43 | + |
| 44 | +To add a new dataset deposit your data in a repository. Then open a *pull |
| 45 | +request* for the https://github.com/MDAnalysis/MDAnalysisData |
| 46 | +repository. Follow these steps: |
| 47 | + |
| 48 | +STEP 1: Archival deposition |
| 49 | +--------------------------- |
| 50 | + |
| 51 | +Deposit *all* required files in an archive-grade repository such as |
| 52 | +`figshare`_. |
| 53 | + |
| 54 | +.. Note:: |
| 55 | + |
| 56 | + The site must *provide stable download links* and *may not change the |
| 57 | + content during download* because we store a SHA256 :ref:`checksum<checksum>` |
| 58 | + to check file integrity. |
| 59 | + |
| 60 | +Make sure to **choose an** `Open Data`_ **compatible license** such as CC0_ or |
| 61 | +`CC-BY`_. |
| 62 | + |
| 63 | +Take note of the **direct download URL** for each of your files. It should be |
| 64 | +possible to obtain the file directly from a stable URL with :program:`curl` or |
| 65 | +:program:`wget`. As an example look at the dataset for |
| 66 | +:mod:`MDAnalysisData.adk_equilibrium` at DOI `10.6084/m9.figshare.5108170`_ (as |
| 67 | +shown in the :ref:`figure below<fig-figshare-adk>`). Especially note the |
| 68 | +*download* links of the DCD trajectory |
| 69 | +(https://ndownloader.figshare.com/files/8672074) and PSF topology files |
| 70 | +(https://ndownloader.figshare.com/files/8672230) as these links will be needed |
| 71 | +in the accessor code in :mod:`MDAnalysisData` in the next step. |
| 72 | + |
| 73 | +.. _fig-figshare-adk: |
| 74 | + |
| 75 | +.. figure:: images/figshare_adk_equilibrium.png |
| 76 | + |
| 77 | + The AdK Equilbrium dataset on figshare DOI `10.6084/m9.figshare.5108170`_, |
| 78 | + highlighting the deposited trajectory and topology files. The *download* |
| 79 | + URLs are visible when hovering over a file's image. |
| 80 | + |
| 81 | + |
| 82 | +.. _`10.6084/m9.figshare.5108170`: |
| 83 | + https://doi.org/10.6084/m9.figshare.5108170 |
| 84 | + |
| 85 | + |
| 86 | + |
| 87 | +STEP 2: Add code and docs to MDAnalysisData |
| 88 | +------------------------------------------- |
| 89 | + |
| 90 | + |
| 91 | +1. Add a Python module ``{MODULE_NAME}.py`` with the name of your dataset |
| 92 | + (where ``{MODULE_NAME}`` is just a placeholder). As an example see |
| 93 | + `MDAnalysisData/adk_equilibrium.py`_, which becomes |
| 94 | + :mod:`MDAnalysisData.adk_equilibrium`). In many cases you can copy an |
| 95 | + existing module and adapt: |
| 96 | + |
| 97 | + - text: describe your dataset |
| 98 | + - :data:`NAME`: name of the data set; will be used as a file name so do not use spaces etc |
| 99 | + - :data:`DESCRIPTION`: filename of the description file (which contains |
| 100 | + restructured text format, so needs to have suffix ``.rst``) |
| 101 | + - :data:`ARCHIVE`: dictionary containing |
| 102 | + :class:`~MDAnalysisData.base.RemoteFileMetadata` instances. Keys should |
| 103 | + describe the file type. Typically |
| 104 | + |
| 105 | + - *topology*: topology file (PSF, TPR, ...) |
| 106 | + - *trajectory*: trajectory coordinate file (DCD, XTC, ...) |
| 107 | + - *structure* (optional): system with single frame of coordinates |
| 108 | + (typically PDB, GRO, CRD, ...) |
| 109 | + |
| 110 | + - name of the :func:`fetch_{NAME}` function (where ``{NAME}`` is a suitable |
| 111 | + name to access your dataset) |
| 112 | + - docs of the :func:`fetch_{NAME}` function |
| 113 | + - calculate and store the reference :ref:`SHA256 checksum <checksum>` as |
| 114 | + described below |
| 115 | + |
| 116 | +2. Add a description file (example: |
| 117 | + `MDAnalysisData/descr/adk_equilibrium.rst`_); copy an existing file and |
| 118 | + adapt. **Make sure to add license information.** |
| 119 | +3. Import your :func:`fetch_{NAME}` function in |
| 120 | + `MDAnalysisData/datasets.py`_. :: |
| 121 | + |
| 122 | + from .{MODULE_NAME} import fetch_{NAME} |
| 123 | + |
| 124 | +4. Add documentation ``{NAME}.rst`` in restructured text format under `docs/`_ |
| 125 | + (take existing files as examples) and append ``{NAME}`` to the second |
| 126 | + ``toctree`` section of the `docs/index.rst`_ file. |
| 127 | + |
| 128 | + .. code-block:: reST |
| 129 | +
|
| 130 | + .. toctree:: |
| 131 | + :maxdepth: 1 |
| 132 | + :caption: Datasets |
| 133 | + :hidden: |
| 134 | + |
| 135 | + adk_equilibrium |
| 136 | + adk_transitions |
| 137 | + ... |
| 138 | + CG_fiber |
| 139 | + {NAME} |
| 140 | + |
| 141 | +If your data set does not follow the same pattern as the example above (where |
| 142 | +each file is downloaded separately) then you have to write your own |
| 143 | +:func:`fetch_{NAME}` function. E.g., you might download a tar file and then |
| 144 | +unpack the file yourself. Use scikit-learn's `sklearn/datasets`_ as examples, |
| 145 | +make sure that your function sets appropriate attributes in the returned |
| 146 | +:class:`~MDAnalysisData.base.Bunch` of records, and fully document what is |
| 147 | +returned. |
| 148 | + |
| 149 | + |
| 150 | +.. _checksum: |
| 151 | + |
| 152 | +RemoteFileMetadata and SHA256 checksum |
| 153 | +====================================== |
| 154 | + |
| 155 | +The :class:`~MDAnalysisData.base.RemoteFileMetadata` is used by |
| 156 | +:func:`~MDAnalysisData.base._fetch_remote` and it will check file integrity by |
| 157 | +computing a SHA256 checksum over each downloaded file with a stored reference |
| 158 | +checksum. **You must compute the reference checksum and store it in your** |
| 159 | +:class:`~MDAnalysisData.base.RemoteFileMetadata` data structure for each file. |
| 160 | + |
| 161 | +Typically you will have a local copy of the files during testing. You can |
| 162 | +compute the SHA256 for a file ``FILENAME`` with the following code:: |
| 163 | + |
| 164 | + python import MDAnalysisData.base |
| 165 | + print(MDAnalysisData.base._sha256(FILENAME)) |
| 166 | + |
| 167 | +or from the commandline |
| 168 | + |
| 169 | +.. code-block:: bash |
| 170 | +
|
| 171 | + python -c 'import MDAnalysisData; print(MDAnalysisData.base._sha256("FILENAME"))' |
| 172 | +
|
| 173 | +where ``FILENAME`` is the file that is stored in the archive. |
| 174 | + |
| 175 | + |
| 176 | +.. references |
| 177 | +
|
| 178 | +.. _`Open Data`: https://opendatacommons.org/ |
| 179 | +.. _CC0: https://creativecommons.org/share-your-work/public-domain/cc0 |
| 180 | +.. _CC-BY: https://creativecommons.org/licenses/by/4.0/ |
| 181 | +.. _figshare: (https://figshare.com/ |
| 182 | +.. _zenodo: https://zenodo.org/ |
| 183 | +.. _DataDryad: https://www.datadryad.org/ |
| 184 | +.. _`Issue Tracker`: https://github.com/MDAnalysis/MDAnalysisData/issues |
| 185 | +.. _`MDAnalysisData/adk_equilibrium.py`: |
| 186 | + https://github.com/MDAnalysis/MDAnalysisData/blob/master/MDAnalysisData/adk_equilibrium.py |
| 187 | +.. _`MDAnalysisData/descr/adk_equilibrium.rst`: |
| 188 | + https://github.com/MDAnalysis/MDAnalysisData/blob/master/MDAnalysisData/descr/adk_equilibrium.rst |
| 189 | +.. _`MDAnalysisData/datasets.py`: |
| 190 | + https://github.com/MDAnalysis/MDAnalysisData/blob/master/MDAnalysisData/datasets.py |
| 191 | +.. _`docs/`: |
| 192 | + https://github.com/MDAnalysis/MDAnalysisData/blob/master/docs/ |
| 193 | +.. _`docs/index.rst`: |
| 194 | + https://github.com/MDAnalysis/MDAnalysisData/blob/master/docs/index.rst |
| 195 | +.. _`sklearn/datasets`: |
| 196 | + https://github.com/scikit-learn/scikit-learn/tree/master/sklearn/datasets |
0 commit comments