Pereira, Tishby and Lee Talks specifically about clustering to relieve data sparseness. I had a bit trouble seeing whether the measures they used showed an advantage for clustering. Baker and McCallum Provides a method for using known class labels to identify sets of words that appear in similar distributions of documents. The resulting word clusters/categories are used for text classification in Naive Bayes. Can we get the benefits of clustering (shared statistics) without hard clusters? Yes, I think so---that's what we're trying to do for Markov models. We estimate the statistics for a set of items that are sufficiently similar. This is done without making a hard partition, but instead using kernels. It remains to be seen whether this is more accurate and/or whether it can be implemented accurately. Slomin and Tishby Aren't words already "clusters"? How can we bootstrap a process like this? Have people looked at MDL clustering? Could that be a way to choose the number of clusters? All the papers from today used real evaluations, which is nice.