Speech recognition for general tasks is widely available in many languages by providers such as Google and Microsoft. However, current level technology has not been able to come up with a good quality general speech recognizer. Therefore, a good quality system requires training on case specific datasets for the task at hand. This is not possible with e.g. Google Speech API. Another aspect of this customization is the ease with which it is possible to take additional aspects of the audio into account on subsequent models. Another concern with these cloud services is confidentiality. In some use cases the data cannot be allowed to leave the organization in question. The costs of continued use of the cloud services can also be considerable. The aim of this project is to help Silo.ai to create an in-house solution for speech recognition. A general deep learning -based speech recognizer is trained on open data and other available sources. The general model is used in the creation of better case specific models, which are trained on client data. The resulting models are portable and can be setup in either cloud environments or local servers. The model can easily be combined with or serve as an input for additional ML models, such as sentence classification.
Robustness to noise and other interference in the environment is a crucial feature of a speech recognition system. The system should also be robust to different speakers, especially in public environments, instead of being adapted specifically to each user. The purpose of this research project is to investigate different approaches to make speech recognition systems robust to noisy environments and different speakers. This project will study speech enhancement techniques with next-generation, data-driven approaches.