Phonetically Rich Urdu Speech Corpus

A step towards Large Vocabulary Urdu Speech Recognition

The Urdu Phonetically Rich Speech Corpus consists of 70 minutes of transcribed read speech consisting of 708 greedily created sentences representing all phonemic and triphonemic combinations in Urdu (based on an 18 million word corpus of Urdu news articles). It consists of 10,101 tokens with 5,656 unique words. In addition to providing phonetic cover for Urdu, the corpus is also phonemically balanced. It also provides triphonemic cover however it is not completely balanced for triphonemes. It contains 60 unique phones and 42,289 phone occurrences. The sentences contained in this corpus are all manually created by trained linguists following a greedy approach to accommodate the words (which were selected using a set cover algorithm) and to prevent additional words as much as possible. Therefore, while correct grammatically, there are some instances where the choice of words in the sentences is unusual.

About us

The Urdu Phonetically Rich Speech Corpus is released by Center for Speech and Language Technologies (CSaLT) at Information Technology University, Lahore.


Project Supervisor: Dr. Sarmad Hussain

Researcher: Agha Ali Raza

Team: Huda Sarfraz, Inaam Ullah, Zahid Sarfraz

Download Instructions

Access it online here in rar file. 


Copyright (c) by Agha Ali Raza, Information Technology University of the Punjab, Lahore, Pakistan. Your use of the Urdu Phonetically Rich Speech Corpus is subject to our Creative Commons License, which lets you distribute, remix, tweak, and build upon our work, even commercially, as long as you credit us for the original creation. You are required to cite the following two publications: