Everything's Beta

things I don't get to do at work :)

Archive for the ‘comp.linguistics’ tag

Automatic Language Detection using Scala

without comments

For a personal project of mine I want to be able to identify the input text’s language. Although this feature is just a nice-to-have, rather than a must-have, its addition would make the results produced much more meaningful. So let’s see if we can hack out a prototype

I know that n-gram models have something to do with language detection so let’s dig around on wikipedia.

The basic idea with using n-gram models for language detection is follows:

  1. Generate n-gram profiles for supported languages (train on corresponding corpa).
  2. Generate n-gram profile for text to identify.
  3. Compare results from # 2 to #3. If the profiles match, so do the languages.

So what is a language’s n-gram profile? It’s just the frequency distribution of n-(character|word) sequence for that language. This tells us how n-grams are chained in a given language and this data is generated using a corpa of text from the language.

So how does comparison work? If n-grams are chained the same between the unknown text and any other language, we know there is a match.

Does this make sense intuitively? Yup. Languages sound different because of the way the consonants and vowels are chained together. And, what we are doing is creating a sort of a fingerprint of how a given language is written.

How well does this work? We’ll see.

Since I have done the last few posts in Erlang, I am going to try to do this one using Scala.

import collection.mutable.HashMap
 
class DeBabel {
	val n : Int = 4
 	var fd : HashMap[String, CountMap[String]] =  new HashMap[String, CountMap[String]]()
 
	def train(pathToCorpus: String, languageId: String) = {
	  fd(languageId) =  new CountMap(getNgrams(scala.io.Source.fromFile(pathToCorpus).mkString))
	}
 
	private def getNgrams(text: String) : Iterable[String] = {
	  //remove non-word chars and trim whitespace.
	  val normalizedText : String = text.replaceAll("\\W+"," ").replaceAll("\\s", " ")
	  (0 to normalizedText.length - n).map(i => normalizedText.substring(i,i+n))
	}
 
	//compare the given text to all languages we can identify
	def compareToLanguages(text: String) : Iterable[Tuple2[String, Double]]  ={
		val ngrams = new CountMap(getNgrams(text))
		fd.map( kv => (kv._1, getSimilarityIndex(ngrams,kv._2)) )
	}
 
	// get similarity between 2 frequency tables as a value between 0 to 1
	def getSimilarityIndex(map1: CountMap[String], map2: CountMap[String]) : Double ={
		var total = 0.0
		var diff = 0.0
 
		map1.foreach{ kv =>
			diff += Math.abs(kv._2 - map2(kv._1))
			total += Math.max(kv._2, map2(kv._1))
		}
		(total-diff)/total	
	}
}
// from http://blog.lostlake.org/index.php?/archives/61-A-spell-checker-in-Scala.html
class CountMap[K](start: Iterable[K]) extends HashMap[K, Int] {
	override def default(key: K) = 0
	start.foreach(v => this(v) = this(v) + 1)
 }

And a simple driver:

object Driver {
	def main(args:Array[String]) = {
		var detector = new DeBabel();
		detector.train("click10.txt","english")
		detector.train("19380.txt","german")
		detector.train("720kc10.txt","french")
		println(detector.compareToLanguages(scala.io.Source.fromFile("25019.txt").mkString)
		          .toList.sort(_._2 > _._2))
	}
}

Note: Texts used were Jules Verne and P.G Wodehouse novels from Project Gutenberg

$ scalac mrd.scala 
$ List((german,0.2715568746868134), (english,0.1184407499488911), (french,0.05424252275682705))

And the text was german so it works. The margins could use a little work, but given the sparse training data I am happy with the results at this stage.

What next ?

I only tested against Romance languages. And it would probably help to train with larger corpa :) I could also have tested to see whether character or word grams give better results(intuitively, it seems that word level n-grams would perform better). I also ignored char-set information.

However, in spite of these limitations, the results were more than satisfactory for the amount of work I put in. Even by training it on just one book, this small program was able to correctly identify the language used. If we tune this using additional heuristics (for example a Language Recognition Chart) we should be able to make the results much better.

Although I haven’t verified this, I feel like this method will work well for languages with a “phonetic” script (that may be the wrong term, what I mean is graphemes represent consonants and vowels in that language). However, languages like chinese/japanese where a graphemes represent whole words, going may be tough. Might this be mitigated using word level n-grams rather than character level ones? Maybe, I don’t know enough (any) Chinese/japanese/Korean to be able to test that.

Disclaimer: As you are probably painfully aware by now, I am not even close to being a computational linguist. I haven’t even sat next to one on a long flight. As such, I apologize for anything in this post that was completely off mark. Feel free to correct me. As this was my first Scala program, the same thing applies there too :)

Written by srijak

August 30th, 2009 at 2:41 am

Posted in code

Tagged with ,