Archive for the ‘comp.linguistics’ tag
Automatic Language Detection using Scala
For a personal project of mine I want to be able to identify the input text’s language. Although this feature is just a nice-to-have, rather than a must-have, its addition would make the results produced much more meaningful. So let’s see if we can hack out a prototype
I know that n-gram models have something to do with language detection so let’s dig around on wikipedia.
The basic idea with using n-gram models for language detection is follows:
- Generate n-gram profiles for supported languages (train on corresponding corpa).
- Generate n-gram profile for text to identify.
- Compare results from # 2 to #3. If the profiles match, so do the languages.
So what is a language’s n-gram profile? It’s just the frequency distribution of n-(character|word) sequence for that language. This tells us how n-grams are chained in a given language and this data is generated using a corpa of text from the language.
So how does comparison work? If n-grams are chained the same between the unknown text and any other language, we know there is a match.
Does this make sense intuitively? Yup. Languages sound different because of the way the consonants and vowels are chained together. And, what we are doing is creating a sort of a fingerprint of how a given language is written.
How well does this work? We’ll see.
Since I have done the last few posts in Erlang, I am going to try to do this one using Scala.
import collection.mutable.HashMap class DeBabel { val n : Int = 4 var fd : HashMap[String, CountMap[String]] = new HashMap[String, CountMap[String]]() def train(pathToCorpus: String, languageId: String) = { fd(languageId) = new CountMap(getNgrams(scala.io.Source.fromFile(pathToCorpus).mkString)) } private def getNgrams(text: String) : Iterable[String] = { //remove non-word chars and trim whitespace. val normalizedText : String = text.replaceAll("\\W+"," ").replaceAll("\\s", " ") (0 to normalizedText.length - n).map(i => normalizedText.substring(i,i+n)) } //compare the given text to all languages we can identify def compareToLanguages(text: String) : Iterable[Tuple2[String, Double]] ={ val ngrams = new CountMap(getNgrams(text)) fd.map( kv => (kv._1, getSimilarityIndex(ngrams,kv._2)) ) } // get similarity between 2 frequency tables as a value between 0 to 1 def getSimilarityIndex(map1: CountMap[String], map2: CountMap[String]) : Double ={ var total = 0.0 var diff = 0.0 map1.foreach{ kv => diff += Math.abs(kv._2 - map2(kv._1)) total += Math.max(kv._2, map2(kv._1)) } (total-diff)/total } } // from http://blog.lostlake.org/index.php?/archives/61-A-spell-checker-in-Scala.html class CountMap[K](start: Iterable[K]) extends HashMap[K, Int] { override def default(key: K) = 0 start.foreach(v => this(v) = this(v) + 1) }
And a simple driver:
object Driver { def main(args:Array[String]) = { var detector = new DeBabel(); detector.train("click10.txt","english") detector.train("19380.txt","german") detector.train("720kc10.txt","french") println(detector.compareToLanguages(scala.io.Source.fromFile("25019.txt").mkString) .toList.sort(_._2 > _._2)) } }
Note: Texts used were Jules Verne and P.G Wodehouse novels from Project Gutenberg
$ scalac mrd.scala $ List((german,0.2715568746868134), (english,0.1184407499488911), (french,0.05424252275682705))
And the text was german so it works. The margins could use a little work, but given the sparse training data I am happy with the results at this stage.
What next ?
I only tested against Romance languages. And it would probably help to train with larger corpa
I could also have tested to see whether character or word grams give better results(intuitively, it seems that word level n-grams would perform better). I also ignored char-set information.
However, in spite of these limitations, the results were more than satisfactory for the amount of work I put in. Even by training it on just one book, this small program was able to correctly identify the language used. If we tune this using additional heuristics (for example a Language Recognition Chart) we should be able to make the results much better.
Although I haven’t verified this, I feel like this method will work well for languages with a “phonetic” script (that may be the wrong term, what I mean is graphemes represent consonants and vowels in that language). However, languages like chinese/japanese where a graphemes represent whole words, going may be tough. Might this be mitigated using word level n-grams rather than character level ones? Maybe, I don’t know enough (any) Chinese/japanese/Korean to be able to test that.
Disclaimer: As you are probably painfully aware by now, I am not even close to being a computational linguist. I haven’t even sat next to one on a long flight. As such, I apologize for anything in this post that was completely off mark. Feel free to correct me. As this was my first Scala program, the same thing applies there too ![]()