|
1 | | -# Roboy Sonosco |
2 | | -Roboy Sonosco (from Lat. sonus - sound and nōscō - I know, recognize) - a library for Speech Recognition based on Deep Learning models |
| 1 | + |
| 2 | +<br> |
| 3 | +<br> |
| 4 | +<br> |
| 5 | +<br> |
3 | 6 |
|
4 | | -## Installation |
| 7 | +Sonosco (from Lat. sonus - sound and nōscō - I know, recognize) |
| 8 | +is a library for training and deploying deep speech recognition models. |
5 | 9 |
|
6 | | -The supported OS is Ubuntu 18.04 LTS (however, it should work fine on other distributions). |
7 | | -Supported Python version is 3.6+. |
8 | | -Supported CUDA version is 10.0. |
9 | | -Supported PyTorch version is 1.0. |
| 10 | +The goal of this project is to enable fast, repeatable and structured training of deep |
| 11 | +automatic speech recognition (ASR) models as well as providing a transcription server (REST API & frontend) to |
| 12 | +try out the trained models for transcription. <br> |
| 13 | +Additionally, we provide interfaces to ROS in order to use it with |
| 14 | +the anthropomimetic robot [Roboy](https://roboy.org/). |
| 15 | +<br> |
| 16 | +<br> |
| 17 | +<br> |
10 | 18 |
|
11 | | ---- |
| 19 | +___ |
| 20 | +### Installation |
12 | 21 |
|
13 | | -Install CUDA 10.0 from [NVIDIA website](https://developer.nvidia.com/cuda-10.0-download-archive). Make sure that your local gcc, g++, cmake versions are not older than the ones used to compile your OS kernel. |
14 | | - |
15 | | -You will need to download the latest [cuDNN](https://developer.nvidia.com/rdp/cudnn-archive) for CUDA 10.0. |
16 | | -Unzip it: |
17 | | -``` |
18 | | -tar -xzvf cudnn-9.0-linux-x64-v7.tgz |
19 | | -``` |
20 | | -Run |
| 22 | +#### Via pip |
| 23 | +The easiest way to use Sonosco's functionality is via pip: |
21 | 24 | ``` |
22 | | -sudo cp cuda/include/cudnn.h /usr/local/cuda/include |
23 | | -sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64 |
24 | | -sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn* |
| 25 | +pip install sonosco |
25 | 26 | ``` |
26 | | ---- |
| 27 | +**Note**: Sonosco requires Python 3.7 or higher. |
27 | 28 |
|
28 | | -**All of the following steps you may perform inside [Anaconda](https://www.anaconda.com/) or [virtualenv](https://virtualenv.pypa.io/en/latest/)** |
| 29 | +For reliability, we recommend using an environment virtualization tool, like virtualenv or conda. |
29 | 30 |
|
30 | | -Install [PyTorch](https://pytorch.org/get-started/locally/). For your particular configuration, you may want to build it from the [sources](https://github.com/pytorch/pytorch). |
| 31 | +<br> |
| 32 | +<br> |
| 33 | +#### For developers or trying out the transcription server |
31 | 34 |
|
32 | | -Install SeanNaren's fork for Warp-CTC bindings. **Deprecated**: will be updated to use [built-in](https://pytorch.org/docs/stable/nn.html#torch.nn.CTCLoss) functions. |
33 | | -``` |
34 | | -git clone https://github.com/SeanNaren/warp-ctc.git |
35 | | -cd warp-ctc; mkdir build; cd build; cmake ..; make |
36 | | -export CUDA_HOME="/usr/local/cuda" |
37 | | -cd ../pytorch_binding && python setup.py install |
| 35 | +Clone the repository and install dependencies: |
38 | 36 | ``` |
| 37 | +# Create a virtual python environment to not pollute the global setup |
| 38 | +conda create -n 'sonosco' python=3.7 |
39 | 39 |
|
40 | | -Install pytorch audio: |
41 | | -``` |
42 | | -sudo apt-get install sox libsox-dev libsox-fmt-all |
43 | | -git clone https://github.com/pytorch/audio.git |
44 | | -cd audio && python setup.py install |
| 40 | +# activate the virtual environment |
| 41 | +conda activate sonosco |
| 42 | +
|
| 43 | +# Clone the repo |
| 44 | +git clone https://github.com/Roboy/sonosco.git |
| 45 | +
|
| 46 | +# Install normal requirements |
| 47 | +pip install -r requirements.txt |
| 48 | +
|
| 49 | +# Link your local sonosco clone into your virtual environment |
| 50 | +pip install . |
45 | 51 | ``` |
| 52 | +Now you can check out some of the [Getting Started]() tutorials, to train a model or use |
| 53 | +the transcription server. |
| 54 | +<br> |
| 55 | +<br> |
| 56 | +<br> |
| 57 | +____________ |
| 58 | +### High Level Design |
| 59 | + |
| 60 | + |
| 61 | + |
| 62 | + |
| 63 | +The project is split into 4 parts that correlate with each other: |
| 64 | + |
| 65 | +For data(-processing) scripts are provided to download and preprocess |
| 66 | +some publicly available datasets for speech recognition. Additionally, |
| 67 | +we provide scripts and functions to create manifest files |
| 68 | +(i.e. catalog files) for your own data and merge existing manifest files |
| 69 | +into one. |
| 70 | + |
| 71 | +This data or rather the manifest files can then be used to easily train and |
| 72 | +evaluate an ASR model. We provide some ASR model architectures, such as LAS, |
| 73 | +TDS and DeepSpeech2 but also individual pytorch models can be designed to be trained. |
| 74 | + |
| 75 | +The trained model can then be used in a transcription server, that consists |
| 76 | +of a REST API as well as a simple Vue.js frontend to transcribe voice recorded |
| 77 | +by a microphone and compare the transcription results to other models (that can |
| 78 | +be downloaded in our [Github](https://github.com/Roboy/sonosco) repository). |
| 79 | + |
| 80 | +Further we provide example code, how to use different ASR models with ROS |
| 81 | +and especially the Roboy ROS interfaces (i.e. topics & messages). |
| 82 | + |
| 83 | +<br> |
| 84 | +<br> |
| 85 | + |
| 86 | + |
| 87 | +______ |
| 88 | +### Data (-processing) |
| 89 | + |
| 90 | +##### Downloading publicly available datasets |
| 91 | +We provide scripts to download and process the following publicly available datasets: |
| 92 | +* [An4](http://www.speech.cs.cmu.edu/databases/an4/) - Alphanumeric database |
| 93 | +* [Librispeech](http://www.openslr.org/12) - reading english books |
| 94 | +* [TED-LIUM 3](https://lium.univ-lemans.fr/en/ted-lium3/) (ted3) - TED talks |
| 95 | +* [Voxforge](http://www.voxforge.org/home/downloads) |
| 96 | +* common voice (old version) |
| 97 | + |
| 98 | +Simply run the respective scripts in `sonosco > datasets > download_datasets` with the |
| 99 | +output_path flag and it will download and process the dataset. Further, it will create |
| 100 | +a manifest file for the dataset. |
| 101 | + |
| 102 | +For example |
46 | 103 |
|
47 | | -If you want decoding to support beam search with an optional language model, install [ctcdecode](https://github.com/parlance/ctcdecode): |
48 | 104 | ``` |
49 | | -git clone --recursive https://github.com/parlance/ctcdecode.git |
50 | | -cd ctcdecode && pip install . |
| 105 | +python an4.py --target-dir temp/data/an4 |
51 | 106 | ``` |
| 107 | +<br> |
| 108 | +<br> |
52 | 109 |
|
53 | | -Clone this repo and run this within the repo: |
| 110 | +##### Creating a manifest from your own data |
| 111 | + |
| 112 | +If you want to create a manifest from your own data, order your files as follows: |
54 | 113 | ``` |
55 | | -pip install -r requirements.txt |
| 114 | +data_directory |
| 115 | +└───txt |
| 116 | +│ │ transcription01.txt |
| 117 | +│ │ transcription02.txt |
| 118 | +│ |
| 119 | +└───wav |
| 120 | + │ audio01.wav |
| 121 | + │ audio02.wav |
56 | 122 | ``` |
| 123 | +To create a manifest, run the `create_manifest.py` script with the data directory and an outputfile |
| 124 | +to automatically create a manifest file for your data. |
57 | 125 |
|
58 | | -### Mixed Precision |
59 | | -If you want to use mixed precision training, you have to install [NVIDIA Apex](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/): |
| 126 | +For example: |
60 | 127 | ``` |
61 | | -git clone --recursive https://github.com/NVIDIA/apex.git |
62 | | -cd apex && pip install . |
| 128 | +python create_manifest.py --data_path path/to/data_directory --output-file temp/data/manifest.csv |
63 | 129 | ``` |
64 | 130 |
|
65 | | -## Usage |
| 131 | +<br> |
| 132 | +<br> |
66 | 133 |
|
67 | | -### Dataset |
| 134 | +##### Merging manifest files |
68 | 135 |
|
69 | | -To create a dataset you must create a CSV manifest file containing the locations of the training data. This has to be in the format of: |
| 136 | +In order to merge multiple manifests into one, just specify a folder that contains all manifest |
| 137 | +files to be merged and run the ``` merge_manifest.py```. |
| 138 | +This will look for all .csv files and merge the content together in the specified output-file. |
| 139 | + |
| 140 | +For example: |
70 | 141 | ``` |
71 | | -/path/to/audio.wav,/path/to/text.txt |
72 | | -/path/to/audio2.wav,/path/to/text2.txt |
73 | | -... |
| 142 | +python merge_manifest.py --merge-dir path/to/manifests_dir --output-path temp/manifests/merged_manifest.csv |
74 | 143 | ``` |
75 | | -There is an example in examples directory. |
76 | 144 |
|
77 | | -### Training, Testing and Inference |
| 145 | +<br> |
| 146 | +<br> |
78 | 147 |
|
79 | | -Fundamentally, you can run the scripts the same way: |
80 | | -``` |
81 | | -python3 train.py --config /path/to/config/file.yaml |
82 | | -python3 test.py --config /path/to/config/file.yaml |
83 | | -python3 infer.py --config /path/to/config/file.yaml |
84 | | -``` |
85 | | -The scripts are initialised via configuration files. |
86 | 148 |
|
87 | | -#### Configuration |
| 149 | +___ |
| 150 | +### Model Training |
88 | 151 |
|
89 | | -Configuration file contains arguments for ModelWrapper initialisation as well as extra parameters. Like this: |
90 | | -``` |
91 | | -train: |
92 | | - ... |
93 | | - log-dir: 'logs' # Location for log files |
94 | | - def-dir: 'examples/checkpoints/', # Default location to save/load models |
95 | | - model-name: 'asr_final.pth' # File name to save the best model |
96 | | - sample-rate: 16000 # Sample rate |
97 | | - window: 'hamming' # Window type for spectrogram generation |
98 | | - batch-size: 32 # Batch size for training |
99 | | - checkpoint: True # Enables checkpoint saving of model |
100 | | - ... |
101 | | -``` |
102 | | -More configuration examples with descriptions you may find in the config directory. |
| 152 | +One goal of this framework is to keep training as easy as possible and enable |
| 153 | +keeping track of already conducted experiments. |
| 154 | +<br> |
| 155 | +<br> |
| 156 | + |
| 157 | +#### Analysis Object Model |
| 158 | + |
| 159 | +For model training, there are multiple objects that interact with each other. |
| 160 | + |
| 161 | + |
| 162 | + |
| 163 | +For Model training, one can define different metrics, that get evaluated during the training |
| 164 | +process. These metrics get evaluated at specified steps during an epoch and during |
| 165 | +validation.<br> |
| 166 | +Sonosco provides different metrics already, such as [Word Error Rate (WER)]() or |
| 167 | + [Character Error Rate (CER)](). But additional metrics can be created in a similar scheme. |
| 168 | + See [Metrics](). |
| 169 | + |
| 170 | +Additionally, callbacks can be defined. A Callback is an arbitrary code that can be executed during |
| 171 | +training. Sonosco provides for example different Callbacks, such as [Learning Rate Reduction](), |
| 172 | +[ModelSerializationCallback](), [TensorboardCallback](), ... <br> |
| 173 | +Custom Callbacks can be defined following the examples. See [Callbacks](). |
| 174 | + |
| 175 | +Most importantly, a model needs to be defined. The model is basically any torch module. For |
| 176 | +(de-) serialization, this model needs to conform to the [Serialization Guide]().<br> |
| 177 | +Sonosco provides already existing model architectures that can be simply imported, such as |
| 178 | +[Listen Attend Spell](), [Time-depth Separable Convolutions]() and [DeepSpeech2](). |
| 179 | + |
| 180 | +We created a specific AudioDataset Class that is based on the pytorch Dataset class. |
| 181 | +This AudioDataset requires an AudioDataProcessor in order to process the specified manifest file. |
| 182 | +Further we created a special AudioDataLoader based on pytorch's Dataloader class, that |
| 183 | +takes the AudioDataset and provides the data in batches to the model training. |
| 184 | + |
| 185 | +Metrics, Callbacks, the Model and the AudioDataLoader are then provided to the ModelTrainer. |
| 186 | +This ModelTrainer takes care of the training process. See [Getting Starter](). |
103 | 187 |
|
104 | | -## Acknowledgements |
| 188 | +The ModelTrainer can then be registered to the Experiment, that takes care of provenance. |
| 189 | +I.e. when starting the training, all your code is time_stamped and saved in a separate directory, |
| 190 | +so you can always repeat the same experiment. Additionally, the serialized model and modeltrainer, |
| 191 | +logs and tensorboard logs are saved in this folder. |
105 | 192 |
|
106 | | -This project is partially based on SeanNaren's [deepspeech.pytorch](https://github.com/SeanNaren/deepspeech.pytorch) repository. |
| 193 | +Further, a Serializer needs to be provided to the Experiment. This object can serialize any |
| 194 | +arbitrary class with its parameters, that can then be deserialized using the Deserializer.<br> |
| 195 | +When the ```Èxperiment.stop()``` method is called, the model and the ModelTrainer get serialized, |
| 196 | +so that you can simply continue the training, with all current parameters (such as epoch steps,...) |
| 197 | +when deserializing the ModelTrainer and continuing training. |
0 commit comments