11-756 / 18799D Design and Implementation of ASR Systems

11-756/18799D ASR: Assignment 7, Training Phoneme Models

In this assignment we will train phoneme HMMs from the digit recordings used in Assignment 6 (both your own recordings and Aurora).

Problem 1: Train Phoneme models from continuous speech digit recordings. For problem 1, use the small corpus of continuous digit recordings you recorded for Assignment 6.

Specifically, you will have to train models for the following phonmes. Model each phoneme using THREE emitting states:

AX, AH, AY, EH, EY, F, IH, IY, K, N, OW, R, S, T, TH, UW, V, W, Z

To do so, express each of the digits from zero through nine as phoneme sequences using the following dictionary:

ONE:    W AX N
TWO:    T UW
THREE:  TH R IY
FOUR:   F OW R
FIVE:   F AY V
SIX:    S IH K S
SEVEN:  S EH V EH N
EIGHT:  EY T
NINE:   N AY N
ZERO:   Z IY R OW

The training procedure is now no different from that of training word models from continuous recordings, except that you will now be training phoneme models.

You will use the dictionary to represent your digit sequences as phoneme sequences. Now, using the procedure used to train digit models from continuous recordings, you can train phoneme models. Model silences as earlier (i.e add a silence model at the beginning and end of recordings, and at locations where you have known pauses between words).

As an example, if you recorded the digit sequence 123456 as training data, and you have silences at the beginning and end of the recording, you would represent the digit sequence in the following manner to train your phoneme models:

SIL W AX N T U TH R IY F OW R F AY V S IH K S SIL

If you know you actually also paused between 3 and 4 in the recording, you'd model it as:

SIL W AX N T U TH R IY SIL F OW R F AY V S IH K S SIL

For recognition, compose digit models from the phoneme models you have trained using the dictionary above. Recognize the same digit sequences you recognized for problem 1 of assignment 6. Note: You will NOT be recognizing phoneme sequences. You will compose word models and perform recognition of words!

The one key new concept you will have to use now is initialization. In Assignment 6 you initialized word models from your isolated word recordings. We do not have isolated phoneme recordings, so that procedure cannot be used.

Instead we will use the following procedure, still using the original isolated word recordings as our training set for initialization:

Count the number of phonemes in each digit. Since we will be using isolated word recordings for initialization, we will assume that each word is bracketed by silence (has silence on either side). So, a recording of "ONE" will be counted as having 5 phonemes: SIL,W,AX,N,SIL.
Segment the recording into K*P segments, where K is the no. of states per phoneme and P is the no. of phonemes in the word. For instance, if you model phonemes with 3 states, you will have to segment a recording of ONE into 15 segments (5 phonemes, 3 states per phoneme).
Use this segmentation as your initial segmentation. The first 3 segments will represent the first 3 states of the first phoneme. The next 3 segments will be from the three states of the second phoneme and so on. For instance, an isolated recording of ONE will have 15 segments. The first 3 segments will be 3 states of SIL. The next three will be the three states of W, the next 3 will be the 3 states of AX and so on.
Using the above initialization, train a complete set of phoneme models from your isolated word recordings. This procedure is exactly the same as training from continuous recordings, only you will be training phoneme models from isolated word recordings.
Use the final converged models as initialization for training phoneme models from continuous digit sequences.
You may want to first test these initial models by composing word models from them and recognizing your test set of isolated word recordings.

Problem 2: Train phoneme models from the Aurora training data handed to you for Assignment 6. Use the models trained in problem 1 as initialization. Compose word models from the trained phoneme models and use those to recognize the test set. Report performance.

The aurora data have an additional digit: "OH". For this, use the following dictionary entry:

OH:  OW

You will note that the phoneme required for this is already trained from problem 1, so you will need no additional work to initialize this model.

As before, use the loopy digit grammar to recognize the test set.

Due date: This is not a mandatory assignment. The goal behind completing this experiment is to demonstrate that you're smarter than everyone else. Also, if you've skipped earlier assignments, you can substitute one of them with this one.