We got algorithms:

UPDATE 9:33PM: Algorithms are good and all, but now we need data.
And now we have data too. So lets get cracking my friend!
Description of the problem: You have one day to take this set of 2484 documents and produce a meaningful categorization of the topics in the documents. What are these documents about?
Input: You don’t actually have the full documents. You only have some statistics about which words occur in which documents. I.e. the documents are garbles — say out of privacy concerns — and instead of the full text of the documents you are only given the counts[w,d] of how many times the word w occurs in document d.
counts(:,1)
ans =
(16,1) 2
(40,1) 1
(46,1) 4
(68,1) 1
(76,1) 1
(95,1) 1
(121,1) 1
(199,1) 1
(290,1) 1
...
This means that in the document 1 (and this will be document 0 in numpy) the words word16 occurs twice, and w40
In other words, if you were to read the words of the document in alphabetical order you would see:
words( [16, 16, 40, 46, 46, 46, 46, 68, 76, 95 ]) #id2word
'ability' 'ability' 'abstract' 'abu' 'abu' 'abu' 'abu' 'access' 'accommodate' 'accumulated'
cool no?
I mean the data of “which words occur in which documents” is all there, but the document is not readable.
The important statistics about the word usage are there though. Ans since the problem was to produce a categorization of the documents, then we probably could still get the job done even without knowing the order the words appeared in.
There is in fact a very simple probabilistic model which requires precisely this kind of representation for the words: each document is a “bag of words”. I have learned recently, though, that one should not make things sound too simple, but use as precise a language as necessary. This is supposed to impress the reader. In this spirit then, because I want to impress you my dear readers — i mean what other reason would there be for me to be writing this post instead of working on the paper — for this reason I will tell you that the bag of words is the informal way of talking about the exchenagability of a set of random variables.
The NIPS data set has
size(docs_names)
1 2484
# documents
# over a dictionary of
size(words)
1 14036
# thus the matrix is
size(counts)
14036 by 2484.
Qualitative output: Say you find that there are 30 topics that are discussed in the documents, 30 categories if you prefer. Then for each topic you have to show the common words of that topic. The users are meant to look at the words, and be like “yeah i know which words these are, this is topic so and so”. I have that part down pat.
Quantitative output: Now why are your categories better than Blei, Jordan and Ng’s? Do you have something measurable you can use to prove your algorithms are better than the baseline?
Ok time to code.
UPDATE June 15th
We have the NIPS corpus under control and liblda just had its way with it.
Lo and behold the data from the first run on the data.
I wanted to see for how long I should run the algorithm before I can read out a “good value”.
An intuitive measure of this is the perplexity: how surprized am I to see the words that I see
in the documents. If you have a good model M then the uncertainty of the
data X given the model is H(X|M) should be a small quantity. In other
words, if the model “knows what it is talking about” then you should be able to predict which
words will be in the document just by looking at the mixture of topics.
My algorithm (actually Griffiths and Steyvers’) starts from a completely random assignment
of topics — if the model was some set of “category boxes” that the documents have to be
filed in, then the algorithm begins by throwing all the documents around in random boxes.
As you can see by the graph when we initialize the model the perplexity is pretty high.
If you were try to look at the topics that the model predicts you will see random words,
with the most likely words in the corpus appearing more often, but in all other senses
completely non-sensical.
This is where Gibbs random sampling comes into the picture: a very simple procedure which consists
of picking up one document out of a box looking at it and throwing it — again randomly —
into a new box. You could put it back in the same box, but if another box has higher likelihood
you are more likely to put it in that box.
(Actually we don’t throw entire documents but individual words, but the analogy would brake down.)
After a while, and in my case this was about 700 repetitions of the whole “de-boxing and re-boxing”
algorithm starts to predict the contents of the document pretty well:

Definition of perplexity text: Given just the model M, and the id of a document d_id,
predict the value of the word count vector counts(:,d_id), for that document.
How does LDA do this? As far as the model M is concerned, document d corresponds to a vector theta_d in mathbb(R)^T. Of if you are not the geometric kind you can think of the vector as a probability distribution p(t|d). LDA describes the document as a mixture of topics, and that mixture is p(t|d). Ok so what is a topic then and how the hell is this topic thing going to help me predict the word counts?
Glad you asked. In fact, the two quesitons that you asked are related. You see a topic t_id is another mixture — a mixture of words distributed according to p(w|t_id). LDA learns that too
In [12]: list(enumerate(theta[0,:]) #theta is what we call p(t|d) in liblda
[(0, 0.00014556040756914121),
(1, 0.026346433770014558),
(2, 0.00014556040756914121),
(3, 0.010334788937409025),
(4, 0.0074235807860262007),
(5, 0.030713245997088797),
(6, 0.010334788937409025),
(7, 0.0030567685589519655),
(8, 0.0016011644832605533),
(9, 0.00014556040756914121),
(10, 0.00014556040756914121),
(11, 0.0088791848617176122),
(12, 0.10931586608442503),
(13, 0.32620087336244541),
(14, 0.00014556040756914121),
(15, 0.00014556040756914121),
(16, 0.0074235807860262007),
(17, 0.00014556040756914121),
(18, 0.00014556040756914121),
(19, 0.0074235807860262007),
(20, 0.0030567685589519655),
(21, 0.00014556040756914121),
(22, 0.12969432314410481),
(23, 0.00014556040756914121),
(24, 0.00014556040756914121),
(25, 0.00014556040756914121),
(26, 0.0016011644832605533),
(27, 0.016157205240174673),
(28, 0.0045123726346433775),
(29, 0.17772925764192141),
(30, 0.00014556040756914121),
(31, 0.00014556040756914121),
(32, 0.0030567685589519655),
(33, 0.00014556040756914121),
(34, 0.00014556040756914121),
(35, 0.00014556040756914121),
(36, 0.0016011644832605533),
(37, 0.01470160116448326),
(38, 0.0088791848617176122),
(39, 0.00014556040756914121),
(40, 0.00014556040756914121),
(41, 0.0059679767103347891),
(42, 0.075836972343522574),
(43, 0.00014556040756914121),
(44, 0.0030567685589519655),
(45, 0.00014556040756914121),
(46, 0.0016011644832605533),
(47, 0.00014556040756914121),
(48, 0.00014556040756914121),
(49, 0.00014556040756914121)]
And then you have to compute the counts you have to take the above mixture vector p(t|d) and
multiply it as follows:

i.e for you have to take each row of the matrix p(w|t) shown below multiply them by the corresponding mixture weight
above and add them together.
T=0 cells, cell, model, cortex, visual, orientation, cortical, receptive, input, stimulus, field, activity, response, spatial, connections, figure
T=1 learning, rules, training, rule, neural, expert, experts, task, learn, network, set, networks, knowledge, tasks, learned, data
T=2 providing, sequences, david, correspondence, entire, address, chain, shows, stress, protein, exploring, verlag, region, mathematics, principles, occupy
T=3 network, neural, networks, input, output, layer, nodes, net, node, inputs, architecture, weights, training, outputs, number, information
classification, training, class, classifier, recognition, set, classifiers, pattern, performance, error, classes, test, patterns, character, feature, data
analog, circuit, chip, figure, current, voltage, output, vlsi, input, circuits, neuron, synapse, pulse, silicon, weight, digital
neural, require, moving, caught, brenner, part, code, institute, distinguished, made, multidimensional, finally, eds, fit, population, fold
tree, node, structure, representations, language, representation, trees, nodes, connectionist, set, symbol, sequence, grammar, strings, context, role
ideas, naftali, engineering, tishby, universal, code, responses, features, department, random, independence, potentials, show, contiguous, coding, synergy
compare, knowledge, position, defined, share, account, detailed, times, low, neural, problem, squares, computation, observers, analyze, things
code, efficiency, codes, jersey, van, coding, israel, apply, traces, estimates, vertical, formulation, hebrew, quantify, engineering, correctly
motion, direction, visual, eye, velocity, field, position, head, system, map, location, moving, spatial, model, target, attention
T=12 distribution, probability, gaussian, information, density, noise, estimate, prior, approximation, variance, bayesian, function, estimation, sample, distributions, entropy
T=13 function, functions, theorem, bound, threshold, number, networks, proof, case, bounds, neural, result, set, size, complexity, class
time, routing, performance, rate, game, call, traffic, strategy, load, decision, schedule, rl, play, control, approach, table
information, definition, hebrew, involved, weak, jersey, bit, code, israel, identified, ruyter, potentials, locations, found, universality, wide
data, error, training, set, prediction, model, test, networks, regression, validation, neural, models, performance, cross, method, parameters
algorithm, learning, gradient, function, convergence, algorithms, error, vector, descent, weight, time, rate, optimal, update, problem, parameter
image, images, object, objects, features, feature, recognition, figure, model, visual, pixel, vision, local, view, pixels, based
center, van, neural, usa, scene, university, version, world, statistical, computer, code, science, spikes, brenner, quantitative, engineering
learning, error, generalization, noise, training, weight, order, large, teacher, case, student, limit, fig, function, curves, line
brenner, naftali, rob, steveninck, huji, tishby, jerusalem, william, sequences, sensory, research, convey, clear, analysis, action, motor
T=22 results, number, figure, values, work, set, simple, large, order, case, similar, single, small, process, paper, found
time, state, system, recurrent, dynamics, neural, networks, states, systems, model, dynamical, fixed, sequence, network, delay, point
rejected, flight, introduction, repeated, abstract, finish, rob, attention, universality, demanding, required, measured, makes, regional, case, lett
control, model, robot, learning, motor, controller, trajectory, arm, system, forward, feedback, figure, inverse, position, hand, movement
model, data, models, mixture, parameters, likelihood, algorithm, em, variables, log, probability, markov, hidden, conditional, bayesian, posterior
universality, representing, encode, bin, number, priori, counting, israel, information, intensity, hebrew, probability, integral, rob, provided, scaling
neurobiology, neural, science, structure, computer, institute, problem, transient, larger, amount, explained, find, neurons, generating, answer, university
T=29 neurons, network, neuron, memory, patterns, pattern, associative, input, model, capacity, activity, synaptic, neural, function, networks, phase
model, cells, system, activity, cell, brain, neurons, figure, sensory, response, behavior, neural, motor, receptor, pattern, gain
ruyter, hand, code, sources, basic, binary, tishby, view, higher, identifiable, bayesian, kernel, computer, coding, trial, science
matrix, linear, component, analysis, components, independent, pca, principal, data, ica, information, source, projection, vector, separation, learning
kernel, support, vector, set, training, margin, function, problem, regression, svm, data, kernels, risk, machines, linear, functions
space, local, points, function, functions, basis, figure, data, dimensional, point, linear, regions, rbf, gaussian, region, global
similarity, presentations, windows, bins, methods, unlike, individuals, naftali, sensitive, standard, correlated, functional, jersey, make, find, differences
universality, sec, school, department, introduction, experiment, grant, drive, code, divergence, sequence, property, present, related, reflection, draw
learning, algorithm, examples, loss, probability, hypothesis, distribution, sample, set, algorithms, class, concept, query, space, learner, positive
search, level, algorithms, genetic, resolution, wavelet, time, structure, algorithm, hierarchical, hierarchy, population, function, levels, coarse, compression
signal, frequency, time, noise, filter, signals, auditory, temporal, filters, processing, system, information, model, phase, sound, figure
speech, recognition, word, system, training, hmm, context, words, speaker, time, phoneme, acoustic, state, frame, performance, sequence
performance, face, human, subjects, task, detection, data, target, set, eeg, faces, analysis, false, based, rate, subject
T=42 learning, state, reinforcement, action, policy, time, optimal, actions, states, function, sutton, reward, environment, algorithm, control, step
distance, vectors, vector, data, clustering, cluster, space, algorithm, feature, tangent, clusters, map, set, transformation, learning, kohonen
units, network, hidden, unit, learning, input, weights, output, training, layer, networks, weight, propagation, net, patterns, error
firing, spike, neuron, neurons, synaptic, time, input, model, rate, membrane, cell, synapses, potential, spikes, current, information
universality, neural, bialek, school, princeton, independence, fly, flies, theory, department, talk, shown, tishby, cases, emerge, systems
energy, problem, field, solution, optimization, graph, function, constraint, networks, solutions, constraints, boltzmann, neural, method, annealing, local
system, parallel, bit, time, neural, data, block, memory, speed, processor, vector, systems, implementation, weight, processing, performance
T=49 visual, sufficiently, directional, related, ens, smaller, variability, times, hope, based, rob, usa, find, avg, http, computation
Ok I know what you are thinking. I was reading this post and having a good time, and then whaaam the author starts throwing these massive lists of words and number and what not. Yes, but this is how it goes in ML. It is all about the data. So I show you the data. If you prefer some high level journalistic sugar, please look elsewhere. Here we don’t simplify.
No no, we will simplify, I was just kidding. The mixture of topics in document 0 consists mostly of zeros, except for the following
In [12]: list(enumerate(theta[0,:] # p(t|d)
(13, 0.32620087336244541),
(29, 0.17772925764192141),
(22, 0.12969432314410481),
(12, 0.10931586608442503),
(42, 0.075836972343522574
So the model thinks that if you mix the words below
T=13 function, functions, theorem, bound, threshold, number, networks, proof,
T=29 neurons, network, neuron, memory, patterns, pattern, associative, input,
T=22 results, number, figure, values, work, set, simple, large, order, case, similar
T=12 distribution, probability, gaussian, information, density, noise, estimate
T=42 learning, state, reinforcement, action, policy, time, optimal, actions, states
You will get the words in document 0.
UPDATE Jun 23: Due to too much blogging and not enough result producing, I didn’t manage to meet the deadline for the conference. Raazclaart, as they say in Jamaica. Rather than take this hit and crumble I am just going to take this hit and stand. Ain’t no stopping me, me tellz you.