Skip to content

Commit 00e7c59

Browse files
committed
update readme
1 parent a5a73d9 commit 00e7c59

2 files changed

Lines changed: 165 additions & 271 deletions

File tree

README.md

Lines changed: 165 additions & 74 deletions
Original file line numberDiff line numberDiff line change
@@ -1,106 +1,197 @@
1-
# Roboy Sonosco
2-
Roboy Sonosco (from Lat. sonus - sound and nōscō - I know, recognize) - a library for Speech Recognition based on Deep Learning models
1+
![# Sonosco](./docs/imgs/sonosco_3.jpg)
2+
<br>
3+
<br>
4+
<br>
5+
<br>
36

4-
## Installation
7+
Sonosco (from Lat. sonus - sound and nōscō - I know, recognize)
8+
is a library for training and deploying deep speech recognition models.
59

6-
The supported OS is Ubuntu 18.04 LTS (however, it should work fine on other distributions).
7-
Supported Python version is 3.6+.
8-
Supported CUDA version is 10.0.
9-
Supported PyTorch version is 1.0.
10+
The goal of this project is to enable fast, repeatable and structured training of deep
11+
automatic speech recognition (ASR) models as well as providing a transcription server (REST API & frontend) to
12+
try out the trained models for transcription. <br>
13+
Additionally, we provide interfaces to ROS in order to use it with
14+
the anthropomimetic robot [Roboy](https://roboy.org/).
15+
<br>
16+
<br>
17+
<br>
1018

11-
---
19+
___
20+
### Installation
1221

13-
Install CUDA 10.0 from [NVIDIA website](https://developer.nvidia.com/cuda-10.0-download-archive). Make sure that your local gcc, g++, cmake versions are not older than the ones used to compile your OS kernel.
14-
15-
You will need to download the latest [cuDNN](https://developer.nvidia.com/rdp/cudnn-archive) for CUDA 10.0.
16-
Unzip it:
17-
```
18-
tar -xzvf cudnn-9.0-linux-x64-v7.tgz
19-
```
20-
Run
22+
#### Via pip
23+
The easiest way to use Sonosco's functionality is via pip:
2124
```
22-
sudo cp cuda/include/cudnn.h /usr/local/cuda/include
23-
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
24-
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*
25+
pip install sonosco
2526
```
26-
---
27+
**Note**: Sonosco requires Python 3.7 or higher.
2728

28-
**All of the following steps you may perform inside [Anaconda](https://www.anaconda.com/) or [virtualenv](https://virtualenv.pypa.io/en/latest/)**
29+
For reliability, we recommend using an environment virtualization tool, like virtualenv or conda.
2930

30-
Install [PyTorch](https://pytorch.org/get-started/locally/). For your particular configuration, you may want to build it from the [sources](https://github.com/pytorch/pytorch).
31+
<br>
32+
<br>
33+
#### For developers or trying out the transcription server
3134

32-
Install SeanNaren's fork for Warp-CTC bindings. **Deprecated**: will be updated to use [built-in](https://pytorch.org/docs/stable/nn.html#torch.nn.CTCLoss) functions.
33-
```
34-
git clone https://github.com/SeanNaren/warp-ctc.git
35-
cd warp-ctc; mkdir build; cd build; cmake ..; make
36-
export CUDA_HOME="/usr/local/cuda"
37-
cd ../pytorch_binding && python setup.py install
35+
Clone the repository and install dependencies:
3836
```
37+
# Create a virtual python environment to not pollute the global setup
38+
conda create -n 'sonosco' python=3.7
3939
40-
Install pytorch audio:
41-
```
42-
sudo apt-get install sox libsox-dev libsox-fmt-all
43-
git clone https://github.com/pytorch/audio.git
44-
cd audio && python setup.py install
40+
# activate the virtual environment
41+
conda activate sonosco
42+
43+
# Clone the repo
44+
git clone https://github.com/Roboy/sonosco.git
45+
46+
# Install normal requirements
47+
pip install -r requirements.txt
48+
49+
# Link your local sonosco clone into your virtual environment
50+
pip install .
4551
```
52+
Now you can check out some of the [Getting Started]() tutorials, to train a model or use
53+
the transcription server.
54+
<br>
55+
<br>
56+
<br>
57+
____________
58+
### High Level Design
59+
60+
61+
![# High-Level-Design](./docs/imgs/high-level-design.svg)
62+
63+
The project is split into 4 parts that correlate with each other:
64+
65+
For data(-processing) scripts are provided to download and preprocess
66+
some publicly available datasets for speech recognition. Additionally,
67+
we provide scripts and functions to create manifest files
68+
(i.e. catalog files) for your own data and merge existing manifest files
69+
into one.
70+
71+
This data or rather the manifest files can then be used to easily train and
72+
evaluate an ASR model. We provide some ASR model architectures, such as LAS,
73+
TDS and DeepSpeech2 but also individual pytorch models can be designed to be trained.
74+
75+
The trained model can then be used in a transcription server, that consists
76+
of a REST API as well as a simple Vue.js frontend to transcribe voice recorded
77+
by a microphone and compare the transcription results to other models (that can
78+
be downloaded in our [Github](https://github.com/Roboy/sonosco) repository).
79+
80+
Further we provide example code, how to use different ASR models with ROS
81+
and especially the Roboy ROS interfaces (i.e. topics & messages).
82+
83+
<br>
84+
<br>
85+
86+
87+
______
88+
### Data (-processing)
89+
90+
##### Downloading publicly available datasets
91+
We provide scripts to download and process the following publicly available datasets:
92+
* [An4](http://www.speech.cs.cmu.edu/databases/an4/) - Alphanumeric database
93+
* [Librispeech](http://www.openslr.org/12) - reading english books
94+
* [TED-LIUM 3](https://lium.univ-lemans.fr/en/ted-lium3/) (ted3) - TED talks
95+
* [Voxforge](http://www.voxforge.org/home/downloads)
96+
* common voice (old version)
97+
98+
Simply run the respective scripts in `sonosco > datasets > download_datasets` with the
99+
output_path flag and it will download and process the dataset. Further, it will create
100+
a manifest file for the dataset.
101+
102+
For example
46103

47-
If you want decoding to support beam search with an optional language model, install [ctcdecode](https://github.com/parlance/ctcdecode):
48104
```
49-
git clone --recursive https://github.com/parlance/ctcdecode.git
50-
cd ctcdecode && pip install .
105+
python an4.py --target-dir temp/data/an4
51106
```
107+
<br>
108+
<br>
52109

53-
Clone this repo and run this within the repo:
110+
##### Creating a manifest from your own data
111+
112+
If you want to create a manifest from your own data, order your files as follows:
54113
```
55-
pip install -r requirements.txt
114+
data_directory
115+
└───txt
116+
│ │ transcription01.txt
117+
│ │ transcription02.txt
118+
119+
└───wav
120+
│ audio01.wav
121+
│ audio02.wav
56122
```
123+
To create a manifest, run the `create_manifest.py` script with the data directory and an outputfile
124+
to automatically create a manifest file for your data.
57125

58-
### Mixed Precision
59-
If you want to use mixed precision training, you have to install [NVIDIA Apex](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/):
126+
For example:
60127
```
61-
git clone --recursive https://github.com/NVIDIA/apex.git
62-
cd apex && pip install .
128+
python create_manifest.py --data_path path/to/data_directory --output-file temp/data/manifest.csv
63129
```
64130

65-
## Usage
131+
<br>
132+
<br>
66133

67-
### Dataset
134+
##### Merging manifest files
68135

69-
To create a dataset you must create a CSV manifest file containing the locations of the training data. This has to be in the format of:
136+
In order to merge multiple manifests into one, just specify a folder that contains all manifest
137+
files to be merged and run the ``` merge_manifest.py```.
138+
This will look for all .csv files and merge the content together in the specified output-file.
139+
140+
For example:
70141
```
71-
/path/to/audio.wav,/path/to/text.txt
72-
/path/to/audio2.wav,/path/to/text2.txt
73-
...
142+
python merge_manifest.py --merge-dir path/to/manifests_dir --output-path temp/manifests/merged_manifest.csv
74143
```
75-
There is an example in examples directory.
76144

77-
### Training, Testing and Inference
145+
<br>
146+
<br>
78147

79-
Fundamentally, you can run the scripts the same way:
80-
```
81-
python3 train.py --config /path/to/config/file.yaml
82-
python3 test.py --config /path/to/config/file.yaml
83-
python3 infer.py --config /path/to/config/file.yaml
84-
```
85-
The scripts are initialised via configuration files.
86148

87-
#### Configuration
149+
___
150+
### Model Training
88151

89-
Configuration file contains arguments for ModelWrapper initialisation as well as extra parameters. Like this:
90-
```
91-
train:
92-
...
93-
log-dir: 'logs' # Location for log files
94-
def-dir: 'examples/checkpoints/', # Default location to save/load models
95-
model-name: 'asr_final.pth' # File name to save the best model
96-
sample-rate: 16000 # Sample rate
97-
window: 'hamming' # Window type for spectrogram generation
98-
batch-size: 32 # Batch size for training
99-
checkpoint: True # Enables checkpoint saving of model
100-
...
101-
```
102-
More configuration examples with descriptions you may find in the config directory.
152+
One goal of this framework is to keep training as easy as possible and enable
153+
keeping track of already conducted experiments.
154+
<br>
155+
<br>
156+
157+
#### Analysis Object Model
158+
159+
For model training, there are multiple objects that interact with each other.
160+
161+
![# Analysis Object Model](./docs/imgs/aom.svg)
162+
163+
For Model training, one can define different metrics, that get evaluated during the training
164+
process. These metrics get evaluated at specified steps during an epoch and during
165+
validation.<br>
166+
Sonosco provides different metrics already, such as [Word Error Rate (WER)]() or
167+
[Character Error Rate (CER)](). But additional metrics can be created in a similar scheme.
168+
See [Metrics]().
169+
170+
Additionally, callbacks can be defined. A Callback is an arbitrary code that can be executed during
171+
training. Sonosco provides for example different Callbacks, such as [Learning Rate Reduction](),
172+
[ModelSerializationCallback](), [TensorboardCallback](), ... <br>
173+
Custom Callbacks can be defined following the examples. See [Callbacks]().
174+
175+
Most importantly, a model needs to be defined. The model is basically any torch module. For
176+
(de-) serialization, this model needs to conform to the [Serialization Guide]().<br>
177+
Sonosco provides already existing model architectures that can be simply imported, such as
178+
[Listen Attend Spell](), [Time-depth Separable Convolutions]() and [DeepSpeech2]().
179+
180+
We created a specific AudioDataset Class that is based on the pytorch Dataset class.
181+
This AudioDataset requires an AudioDataProcessor in order to process the specified manifest file.
182+
Further we created a special AudioDataLoader based on pytorch's Dataloader class, that
183+
takes the AudioDataset and provides the data in batches to the model training.
184+
185+
Metrics, Callbacks, the Model and the AudioDataLoader are then provided to the ModelTrainer.
186+
This ModelTrainer takes care of the training process. See [Getting Starter]().
103187

104-
## Acknowledgements
188+
The ModelTrainer can then be registered to the Experiment, that takes care of provenance.
189+
I.e. when starting the training, all your code is time_stamped and saved in a separate directory,
190+
so you can always repeat the same experiment. Additionally, the serialized model and modeltrainer,
191+
logs and tensorboard logs are saved in this folder.
105192

106-
This project is partially based on SeanNaren's [deepspeech.pytorch](https://github.com/SeanNaren/deepspeech.pytorch) repository.
193+
Further, a Serializer needs to be provided to the Experiment. This object can serialize any
194+
arbitrary class with its parameters, that can then be deserialized using the Deserializer.<br>
195+
When the ```Èxperiment.stop()``` method is called, the model and the ModelTrainer get serialized,
196+
so that you can simply continue the training, with all current parameters (such as epoch steps,...)
197+
when deserializing the ModelTrainer and continuing training.

0 commit comments

Comments
 (0)