Everything's Beta

things I don't get to do at work :)

Quiz time

without comments

Here’s a quick and useless quiz to test how many language names you can type out :)

http://www.sporcle.com/games/psychofish25/hello_world_syntax

q

Yeah, thats a 22/22 you see :P

Written by srijak

December 21st, 2009 at 9:34 pm

Posted in Uncategorized

Ejabberd, exmpp install on Snow Leopard:

with 2 comments

Installing exmpp wasn’t as easy as ./configure; make; sudo make install :(
Regardless of what release I used, I repeatedly got the following non-descript error:

 1> exmpp:start().
 
=INFO REPORT==== 15-Nov-2009::19:44:27 ===
    application: exmpp
    exited: {shutdown,{exmpp,start,[normal,[]]}}
    type: temporary
{error,{shutdown,{exmpp,start,[normal,[]]}}}

The problem turned out to be erl is 32bits on Snow Leopard. So:

1. Compile both ejabberd and exmpp from source like this:
   CC='gcc -m32' CFLAGS=-m32 LDFLAGS=-m32 ./configure 
   make
   sudo make install

The fix was simple enough, but required some fumbling around to figure out the problem. Hope it helps someone else :)

Written by srijak

November 15th, 2009 at 8:31 pm

Posted in Uncategorized

Jiggly goodness

without comments

This is barely worth the effort of making a post for. However, since it took me more than a couple doc. references to figure this out, I figured it will help someone else.

Situation:

I have a text field inside a custom UITableViewCell. When a user tries to edit the text when it is disabled, I want to make it obvious. So, I decided to have the whole cell jiggle back and forth prettily when this happens.

Fix:

Add jiggle animation to the pertinent layer when the cell is selected.

- (void)setSelected:(BOOL)selected animated:(BOOL)animated {
    [super setSelected:selected animated:animated];
    if (!textField.enabled && !first){
        [self jiggle:self.layer]; // send in textField.layer if you just want the text to jiggle
    }
    first = false;  //hack: setSelected is called when the table is loaded...
}
 
-(void) jiggle:(CALayer*) layer{
 
    CABasicAnimation *animation = [CABasicAnimation animationWithKeyPath:@"position"];
    [animation setFromValue:[NSValue valueWithCGPoint:CGPointMake(layer.position.x-2, layer.position.y)]];
    [animation setToValue:[NSValue valueWithCGPoint:CGPointMake(layer.position.x+2, layer.position.y)]];
    [animation setDuration:0.1f];
    animation.repeatCount = 3;
    animation.autoreverses = YES;
 
    [layer addAnimation:animation forKey:@"positionAnimation"];
 
}

If you want to know more about CABasicAnimation, start here.

Written by srijak

October 26th, 2009 at 12:30 am

Posted in code

Tagged with ,

Automatic Language Detection using Scala

without comments

For a personal project of mine I want to be able to identify the input text’s language. Although this feature is just a nice-to-have, rather than a must-have, its addition would make the results produced much more meaningful. So let’s see if we can hack out a prototype

I know that n-gram models have something to do with language detection so let’s dig around on wikipedia.

The basic idea with using n-gram models for language detection is follows:

  1. Generate n-gram profiles for supported languages (train on corresponding corpa).
  2. Generate n-gram profile for text to identify.
  3. Compare results from # 2 to #3. If the profiles match, so do the languages.

So what is a language’s n-gram profile? It’s just the frequency distribution of n-(character|word) sequence for that language. This tells us how n-grams are chained in a given language and this data is generated using a corpa of text from the language.

So how does comparison work? If n-grams are chained the same between the unknown text and any other language, we know there is a match.

Does this make sense intuitively? Yup. Languages sound different because of the way the consonants and vowels are chained together. And, what we are doing is creating a sort of a fingerprint of how a given language is written.

How well does this work? We’ll see.

Since I have done the last few posts in Erlang, I am going to try to do this one using Scala.

import collection.mutable.HashMap
 
class DeBabel {
	val n : Int = 4
 	var fd : HashMap[String, CountMap[String]] =  new HashMap[String, CountMap[String]]()
 
	def train(pathToCorpus: String, languageId: String) = {
	  fd(languageId) =  new CountMap(getNgrams(scala.io.Source.fromFile(pathToCorpus).mkString))
	}
 
	private def getNgrams(text: String) : Iterable[String] = {
	  //remove non-word chars and trim whitespace.
	  val normalizedText : String = text.replaceAll("\\W+"," ").replaceAll("\\s", " ")
	  (0 to normalizedText.length - n).map(i => normalizedText.substring(i,i+n))
	}
 
	//compare the given text to all languages we can identify
	def compareToLanguages(text: String) : Iterable[Tuple2[String, Double]]  ={
		val ngrams = new CountMap(getNgrams(text))
		fd.map( kv => (kv._1, getSimilarityIndex(ngrams,kv._2)) )
	}
 
	// get similarity between 2 frequency tables as a value between 0 to 1
	def getSimilarityIndex(map1: CountMap[String], map2: CountMap[String]) : Double ={
		var total = 0.0
		var diff = 0.0
 
		map1.foreach{ kv =>
			diff += Math.abs(kv._2 - map2(kv._1))
			total += Math.max(kv._2, map2(kv._1))
		}
		(total-diff)/total	
	}
}
// from http://blog.lostlake.org/index.php?/archives/61-A-spell-checker-in-Scala.html
class CountMap[K](start: Iterable[K]) extends HashMap[K, Int] {
	override def default(key: K) = 0
	start.foreach(v => this(v) = this(v) + 1)
 }

And a simple driver:

object Driver {
	def main(args:Array[String]) = {
		var detector = new DeBabel();
		detector.train("click10.txt","english")
		detector.train("19380.txt","german")
		detector.train("720kc10.txt","french")
		println(detector.compareToLanguages(scala.io.Source.fromFile("25019.txt").mkString)
		          .toList.sort(_._2 > _._2))
	}
}

Note: Texts used were Jules Verne and P.G Wodehouse novels from Project Gutenberg

$ scalac mrd.scala 
$ List((german,0.2715568746868134), (english,0.1184407499488911), (french,0.05424252275682705))

And the text was german so it works. The margins could use a little work, but given the sparse training data I am happy with the results at this stage.

What next ?

I only tested against Romance languages. And it would probably help to train with larger corpa :) I could also have tested to see whether character or word grams give better results(intuitively, it seems that word level n-grams would perform better). I also ignored char-set information.

However, in spite of these limitations, the results were more than satisfactory for the amount of work I put in. Even by training it on just one book, this small program was able to correctly identify the language used. If we tune this using additional heuristics (for example a Language Recognition Chart) we should be able to make the results much better.

Although I haven’t verified this, I feel like this method will work well for languages with a “phonetic” script (that may be the wrong term, what I mean is graphemes represent consonants and vowels in that language). However, languages like chinese/japanese where a graphemes represent whole words, going may be tough. Might this be mitigated using word level n-grams rather than character level ones? Maybe, I don’t know enough (any) Chinese/japanese/Korean to be able to test that.

Disclaimer: As you are probably painfully aware by now, I am not even close to being a computational linguist. I haven’t even sat next to one on a long flight. As such, I apologize for anything in this post that was completely off mark. Feel free to correct me. As this was my first Scala program, the same thing applies there too :)

Written by srijak

August 30th, 2009 at 2:41 am

Posted in code

Tagged with ,

Heapsort Pt 1

without comments

I happened to see one of my colleague’s passwords, “h3apS0r7″. As I asked him whether he actually knew how to implement it, I realized that I had never had to implement a Heapsort. And as we all (should) know: knowing how to do X and actually doing X are vastly different things.

So this post is my attempt at fixing that.

What’s Heapsort anyway? Like the name suggests it consists of

  1. Building a (max/min)heap using the data
  2. Plucking out the root(which is either the greatest or smallest value depending on whether you built a max or min heap) continuously such that you have the data in order.

Remind you of something? If you read just #2 you will see that Heapsort is just Selection sort powered by a heap :)

Let’s start off by implementing Selection sort. First we need a function pop_max(List) that returns/removes the maximum element from a list:

-module(heapsort).
-export([pop_max/1]).
 
% pop_max(List) returns {MaxValue, RestOfList}
pop_max([])->{nil,[]};
pop_max(L) -> 
     Max = lists:max(L),
     {Max,lists:delete(Max,L)}.

Integrating this into Selection sort is pretty easy too:

selectionSort([])-> [];
selectionSort(L) ->
    {Max,Rest} = pop_max(L),
    selectionSort(Rest) ++ [Max].
 
% Note: Yeah it's not tail recursive, but I think it looks cleaner.

Let’s try it out:

8> c(heapsort).                    
{ok,heapsort}
9> heapsort:selectionSort([1,5,2,77,8,3,2,5,8,12,53,1,5,87,1,4573,2,969]).
[1,1,1,2,2,2,3,5,5,5,8,8,12,53,77,87,969,4573]

Looks good to me.

Anyway, ideally we should be able to just swap out pop_max with pop_heap.

Let’s figure out how we want to implement the binary heap now. We need to support adding and removing the root from the heap.
Adding to the heap works as follows:

  1. Add the item at the bottom.
  2. Swap the item with it’s parent until the parent is greater than the item (a.k.a sift-up)

Removing from the heap is the inverse .i.e:

  1. Pop off the root and set the last item on the lowest level as the root.
  2. Swap the root with it’s children until the item is greater than it’s children(a.k.a sift-down).

Something else we want to determine is the underlining data structure for the Heap. We could use the binary tree we developed earlier, or we could use a list. Let’s go with the latter.

Anyway, I set myself a time limit of 20 minutes when I started this and it’s kinda late, so I’m going to implement the binary Heap in the next post :)

Written by srijak

August 22nd, 2009 at 10:34 am

Posted in code

Tagged with

Me Likey

without comments

One of, if not *the*, book(s) in computational linguistics is Foundations of Statistical Natural Language Processing by Christopher Mannin and Hinrich Schuetze. I have been working my way (slowly) through this book for a personal project and feel its value to the point where I am seriously considering buying it (current copy is from the library).

Anyway, I went over to Amazon and decided to look see what other books the authors had out which lead me to Intro. to Information Retrieval

Seems like an interesting book right? But, with books like these being pretty expensive, its hard to justify buying one out right. If I did that with all the books I felt like checking out I would be much (much) poorer. So, like we all do, I fired up google to see if I could find a google books version. I found something even better: the book’s site containing a wealth of online resources including free PDF copies. Perfect.

Even as increasing number of books (SICP, Real World Haskell etcs) have an online version for free(*legally*), I think it’s laudable that the authors are doing this.

I like to buy books that are worth keeping for years so I really appreciate being able to vet what I spend my money and shelf space on.

Note to publishers: providing a free online version of a book makes it more likely that I will buy the dead tree version.The catch is the book cant be a throwaway/one time read so YMMV.

Written by srijak

August 20th, 2009 at 10:31 pm

Posted in no-code

Tagged with

Shared + Concurrent BST

without comments

As promised, today we will try to re-implement the BST from last post and make it erlang-y. Once again, I am going to just implement inserts and leave the rest as as exercise for the reader. I am going to try to put down my thought process as I try to solve this.

To recap here’s what the BST insert looks like :

-record(node,{kvTuple,left,right}).
 
binsert(#node{kvTuple={CK,CV},left=L,right=R},{K,V}) when K < CK ->
    #node{kvTuple={CK,CV},left=binsert(L,{K,V}),right=R};
binsert(#node{kvTuple={CK,CV},left=L,right=R},{K,V}) ->
    #node{kvTuple={CK,CV},left=L,right=binsert(R,{K,V})};
binsert(null,{K,V}) ->
    #node{kvTuple={K,V},left=null,right=null}.

My first instinct is to give this process a “mailbox” so that we have a shared BST that can be updated by multiple processes concurrently. However, this will process the messages sequentially. To mitigate this we can spawn a process to handle each message and so parallelize it. Is this the best solution though ? As we are going utilizing parallelization, can we break this up into smaller/more processes?

What about if each node is a process? Would that work? Would this incur too much overhead ? Hopefully not. Erlang’s supposed to have really lightweight processes so we are going to assume it’s ok (maybe a future post about exactly what the overhead is).

So at this point I did a quick google search on concurrent BSTs to see if there were other ways I was missing and I ran into this post by Faud AlTabba which totally messed up this post :)
His post has an implementation for each of the ideas I talked about and covers everything I hoped too. So go read it. I’m going to go try to find something else to practice Erlang on.

Written by srijak

July 29th, 2009 at 8:59 pm

Posted in code

Tagged with

Quick and dirty erlang BST

without comments

The book I have been reading implements a Set using a BST. Implementing a naive BST
in Erlang proved much easier than I expected. Maybe thats because I just implemented inserts :)

I started with a function that does sorted inserts on lists :

insert([],V) -> V;
insert([H|T],V)  when V > H ->
    [H|insert(T,V)];
insert([H|T],V)  when V < H ->
    [insert(H,V)|T];
insert(H,V) ->
    [V|H].

Once I defined a node record, converting to a bst was pretty easy:

%% what our node looks like
-record(node,{kvTuple,left,right}).
 
binsert(#node{kvTuple={CK,CV},left=L,right=R},{K,V}) when K < CK ->
    #node{kvTuple={CK,CV},left=binsert(L,{K,V}),right=R};
binsert(#node{kvTuple={CK,CV},left=L,right=R},{K,V}) ->
    #node{kvTuple={CK,CV},left=L,right=binsert(R,{K,V})};
binsert(null,{K,V}) ->
    #node{kvTuple={K,V},left=null,right=null}.

Adding a “memberOf” function is pretty much exactly like the insert so I wont bother showing it here.

Anyway, since that was trivial to implement, lets take it a step further: Erlang’s strength is concurrency so next post will be about adding concurrency to this BST.

Written by srijak

July 26th, 2009 at 8:58 pm

Posted in code

Tagged with

A little more Erlang

without comments

I have been meaning to read Purely Functional Data Structures for a while. So, I grabbed a copy earlier and started to work through the book using Erlang.

Translating1 the first couple of examples to Erlang was easy and looked pretty much exactly like the ML implementations:

%concatenate 2 lists
concat([],X) -> X;
concat(Y,[]) -> Y;
concat([H|T],X) -> [H|concat(T,X)].
 
%update value at given index
update([_H|T],0,N) -> [N|T];
update([H|T],C,N) -> [H|update(T,C-1,N)].



The first exercise (2.1) was to write a function suffixes that takes a list xs and returns a list of all suffixes in decreasing order of length.

%get all suffixes for the list
suffixes([]) -> [];
suffixes([H|T]) -> [[H|T]|suffixes(T)].

These are generated in O(n) time and represented in O(n) space.


Next up implementing a Binary Search Tree…



1.Translating: Not strictly translating. I write my own from the description, then match it with the author’s ML implementation to make sure they are close enough.

Written by srijak

July 26th, 2009 at 5:24 pm

Posted in code

Tagged with

HTTP PUT Rails gotcha

without comments

So, I was pretty surprised to learn that Rails PUT requests are required to be  “application/x-www-form-urlencoded”. Definitely not expected behavior when I am just consuming someone else’s API. If that isn’t completely true, feel free to enlighten me.

Basically here’s what happened: I was doing a HTTP PUT using C# to an API on Rails. C# looked pretty much like this:

request = (HttpWebRequest) WebRequest.Create(uri);
request.Method = "PUT";
request.ContentType = "text/plain";
request.ContentLength = postData.Length;
using(var writeStream = request.GetRequestStream())
{
      var bytes = (new UTF8Encoding()).GetBytes(postData);
      writeStream.Write(bytes, 0, bytes.Length);
}

But the code no worky.

After trying a couple of different things, I talked to the people who implemented the API. They said the Rails log indicated that I wasn’t PUTing anything .i.e. they could see the PUT request, but not the data. So, my first instinct was that it must be the content type, but was told it very definitely couldn’t be that.

Mystified, I wrote a quick Ruby script to do the same request. It worked!

Out came Wireshark.

Guess what the differences were ?

  1. C# has keep-alive set, Ruby didn’t.
  2. Ruby content-type was “application/x-www-form-urlencoded”.

I figured 1 really wasn’t the problem and since 2 supported my gut reaction, I went with it.
Sure enough, once I set content-type to “application/x-www-form-urlencoded” the c# code worky.

Moral of story:

  • Remember Rails requires a PUT to be of Content-type:”application/x-www-form-urlencoded”. Maybe it takes other things, but it doesn’t take text/plain.
  • Be very aware how the different layers/libraries interact in your app. All that magic is a double edged sword. There are 3rd party libs out there that do everything. But when you use one, please be sure you know what it does :)
  • When debugging take all assumptions (esp. other people’s :) ) with a grain of salt.

Written by srijak

July 22nd, 2009 at 7:12 pm

Posted in code

Tagged with , , , ,