Dirichlet Class Language Model for Speech Recognition
Jen-Tzung Chien and Chuang-Hua Chueh
Latent Dirichlet allocation (LDA) was successfully developed for document modeling due to its generalization of unseen documents through latent topic modeling. LDA calculates the probability of a document based on the bag-of-words scheme without considering the order of words. Accordingly, LDA cannot be directly adopted to predict words in speech recognition systems. This work presents a new Dirichlet class language model (DCLM), which projects the sequence of history words onto a latent class space and calculates a marginal likelihood over the uncertainties of classes, which are expressed by Dirichlet priors. A Bayesian class-based language model is established and a variational Bayesian procedure is presented for estimating DCLM parameters. Furthermore, the long-distance class information is continuously updated using large-span history words and is dynamically incorporated into class mixtures for a cache DCLM. The cache DCLM effectively characterizes the unseen n-gram events as well as builds the class cache for long-distance language modeling. Different language models are experimentally evaluated using the Wall Street Journal (WSJ) corpus. We obtain DCLM and cache DCLM achieved 3%-5% relative gain in terms of error rate reduction over LDA LM, even if the DCLM requires longer computational time for model training. The amount of training data and the size of vocabulary are evaluated.
Download the readme.txt.
Download the codes of DCLM: DCLM.rar
This is a C implementation of variational EM for Dirichlet class language model (DCLM) estimation. The function to calculate the DCLM and cache DCLM are also provided.
The released DCLM.rar contains
1. DCLM.cpp: full source code of VC for VBEM DCLM estimation.
2. DCLM_probability.cpp: source code of the function that calculates the n-gram probability using DCLM or cache DCLM.
3. LoadModel.cpp: source code of the function that loads the estimated DCLM model.
4. Parser.cpp and Parser.h: source code to segment one line read from file based the blank.
Our OS is Windows XP Professional X64 edition sp2. The computer has Intel Core2Duo 3GHz CPU and 8GB ram.
The original code was implemented using Microsoft Visual C++R 6.0. To deal with the large memory requirement, the source code was automatically transformed to Microsoft Visual C++ 2008 that support 64 bits memory addressing.
PS. When transforming the codes, some compile error may be occurred, i.e. some declarations of variables will be lost. Please manually add the declarations of these variables.