In some other post I already talked “superficially” about Julius (Continuous Speech Recognition). However, today we are going to know a bit more about it. Keep in mind that Julius is installed on OpenQbo so that you can carry out your own tests.
Started by Kyoto University in Japan in 1991, Julius is a real time acoustic recognition engine for continuous dictation based on the interpretation of Markov models. What Julius does is to adopt the acoustic models and pronunciation dictionaries from the HTK software, which unlike Julius is not opensource, but it can be used and downloaded for its use and later generation of the acoustic models.
Julius & Qbo’s grammar ( in real time ) running on Qbo
We have previously said that Julius lets us recognize speech by continuous dictation or by using a previously introduced grammar but, what does it mean?
“An acoustic model is a file which contains an statistical representation of each one of the different sounds forming a word (phonemes). Hidden Markov models are normally used for that statistical representation.”
Let’s explain this with a practical example:
ADVANCE [ADVANCE] ax d v ae n s
ADVANCED [ADVANCED] ax d v ae n s t
ADVANCEMENT [ADVANCEMENT] ax d v ae n s m ax n t
ADVANCES [ADVANCES] ax d v ae n s ax z
In the upper part you can see how the words ADVANCE, ADVANCED, ADVANCEMENT and ADVANCES are split. If we wanted to train Julius to make it recognize each one of these words we would do it by recording each word using a recording programme and a microphone connected to the computer. Once we have obtained the sound files we would convert them into other files which contain a statistical representation of the words. For this purpose we can use, for example, the HTK programme. In this process, apparently simple but it is not, we should tell HTK and Julius how we split those words so that later it could identify each phoneme separately by adjusting the weights in an optimal way. Finally, we would join them and try to identify the word searching for the most approximated weight.
The problem of having an acoustic model good enough to carry out a continuous dictation is that it needs to be trained with many voice files. The bigger the number of sound files which contain different voices and texts is the higher the hit rate of the acoustic model. At the moment Julius does not have a model in English good enough for continuous dictation. Due to this need, the Voxforge project was born led by Ken MacLean who tries to create, with the help of all of us, an acoustic model sufficiently acceptable for continuous dictation.
But, how does Voxforge work? It is very simple and gratifying! From your navigator and with a microphone you can record your own voice by reading some of the text suggested by Voxforge and once you have recorded it you can automatically upload it to Voxforge to be compiled into the acoustic model. You can do your bit by recording some sentences now clicking on this link.
While projects like Voxforge continue trying to create an acoustic model good enough, we suggest two possibilities for our platform:
The first one would be to provide an interface from Qbo’s chatterbot (we will talk about it in another post) where the user could record through the robot the sentences he keeps introducing in his conversations with Qbo and send them automatically to TheCorpora’s servers. In this way, our scripts could start creating an acoustic model of real conversations which have been produced with the robot.
The second one would be designing a grammar as wide as possible using Voxforge’s provisional acoustic model. We would use 3 files:
1.- A list of the words (between square brackets) that we want Qbo to recognize and their split into phonemes (at the right of the square brackets)
Real example of part of Qbo’s dictionary file:
0 [<s>] sil
1 [</s>] sil
2 [HOW] hh aw
2 [TO] t uw
3 [WHAT] w ah t
4 [I] ay
4 [YOU] y uw
4 [HE] hh iy
4 [SHE] sh iy
4 [IT] ih t
4 [WE] w iy
4 [THEY] dh ey
4 [YOUR] y ao r
5 [AM] ae m
5 [ARE] aa r
5 [IS] ih z
5 [MEET] m iy t
6 [GOOD] g uh d
7 [MORNING] m ao r n ix ng
7 [AFTERNOON] ae f t er n uw n
7 [EVENING] iy v n ix ng
7 [NIGHT] n ay t
7 [BYE] b ay
7 [NAME] n ey m
8 [HELLO] hh eh l ow
9 [YES] y eh s
10 [PLEASE] p l iy z
11 [NO] n ow
12 [THANKS] th ae ng k s
13 [FINE] f ay n
14 [WHO] hh uw
*[<s>] sil it means silence between words
2. A list of the words divided according to categories.
HOW hh aw
TO t uw
WHAT w ah t
YOU y uw
HE hh iy
SHE sh iy
IT ih t
WE w iy
THEY dh ey
YOUR y ao r
AM ae m
ARE aa r
IS ih z
MEET m iy t
3. A file which contains the order and construction of the words.
S : NS_B QUESTION2 NS_E
S : NS_B SENTENCE2 NS_E
QUESTION2: ADVERB VERB SUBJET
SENTENCE2: SUBJECT VERB
In this way, we first separate with silences each construction and then we tell Julius how we want it to build and recognize the sentences.
If you want to test Julius on OpenQbo follow these instructions:
1. Connect a microphone to the computer.
2. Start OpenQbo distro
3. Firstly GO to MENU System | Administration | Synaptic Package manager
4.- On Quick Search write julius + enter & Select julius-voxforge & Apply
5.- GO PLACES MENU | HOME FOLDER & create a new folder “julius-sample”
6.- OPEN julius-sample folder & COPY all files from PLACES | Computer | File System | usr | share | doc | julius-voxforge | examples TO julius-sample
7.- EXTRACT julian.jconf.gz file
8.- GO to MENU Applications | Accessories | Terminal
6.- On TERMINAL WRITE cd julius-sample + enter
7.- WRITE mkdfa sample + enter to compile sample files
8.- WRITE julius -input mic -C julian.jconf + enter
9.- SPEAK, for example “PHONE KEN”
NOTE: The first 2-3 seconds of your speech will not be recognized – Julian adjusts its recognition levels (that is what the reference to their being “no CMN parameter is available on startup” is all about)