Topic Models for Morphologically Rich Languages and their Usage to Explore Multilingual Corpora

Topic Models are statistical models which capture co-occurrence
patterns of words across documents.  Observation of topic models
learned over large textual corpora indicates that they capture
thematically coherent word distributions.  In recent years, topic
models have emerged as a useful method to infer practical semantic
features for a variety of applications.  In this talk, we present a
variant of the Latent Dirichlet Allocation (LDA) method for
morphologically rich languages and discuss its possible usage for
multilingual corpus analysis.

LDA is a generative model: it assumes that documents are generated
from a distribution over topics and in turn, topics are distributions
over the vocabulary.  Statistical inference discovers the topics that
best explain a corpus.  Traditionally, LDA operates over a token-based
representation of documents - where tokens are sometimes filtered to
remove frequent words or to keep only nouns.  We show that such a
token-based document representation is inadequate when applied to a
Hebrew corpus, as it fails to capture coherent word patterns, due to
the large number of morphological variants.  We then present a
lemma-based model that infers an additional layer between documents
and tokens.  This model successfully captures rich topic models on a
variety of corpora in Hebrew, in the legal and medical domains.

Finally, we survey recent techniques used to infer multilingual topic
models over multilingual corpora - both in the case of aligned and
unaligned texts.  We also discuss a setting where documents in Hebrew
are annotated using an English ontology in the medical domain, and how
the Hebrew topic models can be aligned with terms of the English
ontology, thus deriving a useful resource for further machine
translation processing.

This is joint work with Meni Adler, Yoav Goldberg and Rafi Cohen.