It is all about me!
you have nothign to prove, except theorems


 
Soundtrack:

It is all about the soundtrack in your life.
The one that plays while you act on life's stage.
Po wered by WordPress

-->

August 21, 2010

Filed under: liblda — ivan @ 4:38 am

This is one of the results that comes up when I type my name into google.

Very cool!
I mean this is what I want to be associated with ;)

Let’s have an update
We are waaaay past June, and no LDA code has been written. Some papers have been read. Some new results out on the list. I spoke with my supervisor and she said my project was a good idea. Let’s see if I can make it…

I kind of stopped thinking about a whole py-lda, because I learned about the MAHOUT project, which has a perfectly good LDA implementation that can even take advantage of clusters. I want to investigate that further and maybe write MAHOUT plugins instead of making my own complete ML library…

On the other hand playing with NumPy arrays from C and python will probably be a good exercise in efficiency (and pointer counting).

On the theory side, I have managed to word the problem scientifically, but I am not sure how it can be solved…. wait. I just realized I was complicating the problem unnecessarily. Here is the simpler version:

Let p(x) be a discrete probability distribution over x \in {list of words}
<br />
\sum p(x)  = 1<br />

Let  q_1(x), q_2(x), q_3(x), \ldots, q_n be a set of prob distributions and let
<br />
 q(x) = \lambda_1 q_1(x) + \lambda_2 q_2(x) + \cdots + \lambda_n q_n(x)<br />

Can you find the optimal  \vec{\lambda} which minimizes the Kullback-Liebler
divergence between p(x) and q(x), i.e
<br />
  \argmin_\lambda KL(p(x), \vec{\lambda}\cdot\vec{q}(x) )<br />


May 14, 2010

Summer has begun

I am not wearing shorts yet, but I have been spending a lot more time outside in the sun. Looking at the calendar I see that 15 days of summer have already expired without much productivity from my part. Not on projects, not on research and not even on the resting front.

Now is the time to sit down and make a little plan for the coming months to see how things will go down. The main plan is to work hard this summer. Yeah. Seriously work, work and work. Try to get to some high-energy level where I am waking up early and getting 4h of intellectual labour per day and 4h of other stuff.
Come september I get on a plane and ship myself to Bulgaria for a couple of months. Productivity in-a-de-homeland? Vacation too. Go to the cottage. Visit P. in Berlin. Visit D. in Austria/London. Why not an Amsterdam trip while we are at it.

But lets get more specific on the work side. What work absolutely HAS to be done by the end of this summer.

  1. PhD thesis topic settled
  2. Interference channel paper
  3. minireference content (en: and fr:) be written and organized
  4. minireference website
  5. liblda.py must be started

Apart from that there are these projects that I would like to work on,
but they are not mission critical.

  • The kronos project with A., more specifically the web-text editor, the scheduler and the pdflatex renderer…
  • Non-binning information theory for neuroscience research
  • Writing up papers
  • minireference iPhone/iPad app in Objective-C

So the TODO list has been set down. Now the harder part of actually doing all that is in the list comes about. I have to be frank with myself — none of this will get done without effort and without getting up early in the morning. I have to get myself to a higher energy level and then things will work out.

Vamolos !


January 6, 2010

wikipedia LDA

Filed under: Computers, liblda — ivan @ 7:01 pm

There is more work to do on the arXiv data set, but I can’t wait to also run the topic model on wikipedia… I mean it should be interesting.

I downloaded a 25G XML file from the wikipedia site. That is a FAIR bit of text don’t you think?
I might have to hook up more RAM then is currently available to me.


grep " ” enwiki-latest-pages-articles.xml | wc
9326872 9326872 83941848

This means there are roughly 9 million pages in wikipedia?
no…. actually lots of them are #REDIRECT pages.

grep "#REDIRECT" enwiki-latest-pages-articles.xml | wc
3672113 20616375 298192347

So there are roughly 5M articles. Maybe some of them are categories and disambiguation pages…. here, they say there are 3M articles. What to do….

It is a SIGNIFICANT jump from 20k to 3M articles don’t you think?
15 times more — if it took me 600M of RAM before it will take me approx. 9G of ram now…. hm…
it looks like we will have to sub-sample my friend ;)

Maybe it is time I built a POWER SERVER with 16G of RAM in my own home?
Lack of technology should never slow me down right…. if this is my dream then I should act on it.