we attacked the problem of solving crossword puzzles by computer: given a set of clues and a crossword grid, try to maximize the number of words correctly filled in. after an analysis of a large collection of puzzles, we decided to use an open architecture in which independent programs specialize in solving specific types of clues, drawing on ideas from information retrieval, database search, and machine learning. each expert module generates a (possibly empty) candidate list for each clue, and the lists are merged together and placed into the grid by a centralized solver. we used a probabilistic representation throughout the system as a common interchange language between subsystems and to drive the search for an optimal solution. letters correct in under 15 minutes per puzzle on a sample of 370 puzzles taken from the new york times and several other puzzle sources. this corresponds to missing roughly 3 words or 4 letters on a cruciverbalist (crossword solver). words of truth... crossword puzzles are attempted daily by millions of people, and require of the solver both an extensive knowledge of language, history and popular culture, and a search over possible answers to find a set that fits in the grid. this dual task, of answering natural language questions requiring shallow, broad knowledge, and of searching for an optimal set of answers for the grid, makes these puzzles an interesting challenge for artificial intelligence. in this paper, we solver for british-style crosswords published by genius~2000 software. it is intended as a solving aid, and while it appears quite good at thesaurus-type clues, in informal tests it did poorly at human champions, it exceeds that of casual human solvers, averaging we will first describe the problem and provide some of the insights we gained from studying a large database of crossword puzzles, which motivated our design choices. we will then discuss our underlying answers to clues are suggested by expert modules, and how we search for an optimal fit of these possible answers into the grid. finally, we will present the system's performance on a large test suite of daily crossword puzzles, as well as on 1998 tournament puzzles. the solution to a crossword puzzle is a set of interlocking words is presented with an empty grid and a set of clues; each clue suggests its corresponding target. some clue-target pairs are relatively fixed-width font; all examples are taken from our clue database. we will note the target length following sample clues in this paper to oblique and based on word play: most a dozen or so words long, averaging about 2.5 words in a sample of clues we've collected. to solve a crossword puzzle by computer, we assume that we have both the grid and the clues in machine readable form, ignoring the special formatting and unusual marks that sometimes appear in crosswords. the of letters, given the numbered clues and a labeled grid. in this work, we focus on american-style crosswords, as opposed to british-style or cryptic crosswords. by convention, all targets are at least 3 letters in length and long targets can be constructed by part of a down target and an across target. as this is largely a new problem domain, distinct from crossword solving really was. to gain some insight into the problem, we studied a large corpus of existing puzzles. we collected 5133 crossword puzzles from a variety of sources, summarized in newspaper puzzles (the new york times, the los angeles times, the usa today, tv guide), from online sites featuring puzzles (dell, riddler) or from syndicates specifically producing for the online medium (creator's syndicate, crossynergy syndicate). these puzzles constitute a crossword database (cwdb) of around 350,000 clue-target pairs, with over 250,000 of them unique, which served as a potent knowledge source for this project. human solvers improve with experience, in part because particular clues and targets tend to recur. for example, many human solvers will ``otherwise,'' and ``region''. the five most common clue-target pairs our cwdb corresponds to the number of puzzles that would be encountered by a human over a fourteen-year period, at a rate of one puzzle a day. what percentage of targets and clues in a new puzzle presented to our system will be in the existing database---how novel are crossword novel targets, clues, clue-target pairs, and clue words as we increase the number of elements in the database. after randomizing, we looked at subsets of the database ranging from 5,000 clues to almost 350,000. for each subset, we calculated the percentage of the particular item (target, clue, clue-target, clue word) that are unique. this is an estimate for the likelihood of the next item being novel. given the complete database (344,921 clues) contains a tremendous amount of useful domain-specific information. cwdb. given all 350,000 clues, we would expect a new puzzle the new york times (nyt) crossword is considered by many to be the premiere daily puzzle. nyt editors attempt to make the puzzles increase in difficulty from easy on monday to very difficult on saturday and sunday. we hoped that studying the monday-to-saturday trends in the puzzles might provide insight into what makes a puzzle hard for humans. types change day by day. for example, note that some ``easier'' less and less common as the week goes on. in addition, clues with a which is often a sign of a themed or pun clue, get more common. the distribution of target lengths also varies, with words in the~6 to~10 letter range becoming much more common from monday to saturday. sunday is not included in the table as it is a bit of an outlier on some of these scales, partly due to the fact that the puzzles are such as fill-in-the-blank and quoted phrases, clue structure leads to simple ways to answer those clues. for example, given the clue sources looking for all 9-letter phrases that match on word boundaries and known letters. if encounter a clue such as abbreviations. there are also clues that do not fit simple pattern, but might be that can be brought to bear to solve different types of clues, this suggests a two-stage architecture for our solver: one consisting of a collection of special-purpose and general candidate-generation modules, and one that combines the results from these modules to generate a solution to the puzzle. this decentralized architecture allowed a relatively large group of contributors (approximately ten people) to contribute modules using techniques ranging from generic word lists to highly specific modules, from string matching to general-purpose information retrieval. the next section describes candidate lists, in isolation from the other grid constraints. expert modules are free to return no candidates for any clues, or 10,000 for every one. the collection of candidate lists is then reweighted by and combined into a single list of candidates for each clue. finally, solution it can find that also satisfies the grid constraints. are described in a later section. to unify the candidate-generation modules, it is important to first understand our underlying assumptions about the crossword-puzzle problem. first, assume that crossword puzzles are created by repeatedly choosing words for the slots according to a particular creator's distribution (ignore clues and crossing constraints for now). after choosing the words, if the crossing constraints are satisfied, then the creator keeps the puzzle. otherwise, the creator draws again. normalizing to account for all the illegal puzzles generated gives us a probability distribution over legal puzzles. now, suppose that for each slot in a puzzle, we had a probability distribution over possible words for the slot given the clue. then, we could try to solve one of a number of probabilistic optimization problems to produce the ``best'' fill of the grid. in our work, we define ``best'' as the puzzle with the maximum expected number of targets in common with the creator's solution: the maximum expected overlap. we will discuss this optimization more in a following section, but for now it is important only to see that we would like to think of candidate generation as establishing probability distributions over possible solutions. we will next discuss how individual modules can create approximations to these distributions, how we can combine them into a unified distributions, and then finally how we can search to find a good solution to the optimization problem. the first step is to have each module generate candidates for each clue, given the target length. each module returns a confidence score (how sure it is that the answer lies in its list), and a weighted list the module returns a 1.0 confidence in its list, and gives higher weight to the person on the show with the given last name, while giving lower weight to other cast members. note that most of the modules will not be able to generate actual probabilities distributions for the targets, and will need to make approximations. the merging step discussed next will attempt to account for the error in these estimates by testing on training data, and adjusting scaling parameters to compensate. it is important for modules to be consistent, and to give more likely candidates more weight. also, the better control a module exerts over the overall confidence score when uncertain, the more the merger will ``trust'' the module's predictions. in all, we built 30 different modules, many of which are described briefly below. to get some sense of the contribution of the major containing 5374 clues. these puzzles were drawn from the same sources as the test puzzles, ten from each. for each module, we list several measures of performance: the percentage of clues that the module guessed at, the percentage of the time the target was in the module's candidate list, the average length of the returned lists, and the percentage of clues the module ``won''---it had the correct answer weighted higher than all other modules. this final statistic is an important measure of the module's contribution to the system. for example, the wordlist-big module generates over 100,000 words for some time). however, since it generates so many, the individual weight given to the target is usually higher than that assigned by other another way of looking at the contribution of the modules is to consider the probability assigned to each target given the clues. ideally, we would like all targets to have probability 1. in general, we want to maximize the product of the probabilities assigned to the targets, since this quantity is directly related to what the represents the probability assigned by the bigram module (described later). this probability is low for all targets, but very low for the hard targets. as we add groups of modules, the effect on the probabilities assigned to targets can be seen as a lowering of the curve, which corresponds to assigning more and more probability to the target. note the large increase due to the exact match module. finally, notice that there is a small segment that we do very poorly on---the targets that no module other than bigram returns. we will later introduce extensions to the system that help with this range. added shows that different types of modules make different return all words of the correct length from several dictionaries. wordlist contains a list of 655,000 terms from a wide variety of sources, including online texts, enclyopedias and dictionaries. wordlist-big contains everything in wordlist, as well as many constructed `terms', produced by combining related entries in databases. this includes combining first and last names, as well as merging adjacent words from clues in the cwdb. wordlist-big contains over 2.1 million terms. wordlist-cwdb contains the 58,000 unique targets in the cwdb, and returns all targets of the appropriate length, regardless of the clue. it weights them with estimates of their ``prior'' probabilities as targets of arbitrary clues. length associated with this clue in the cwdb. confidence is based on a bayesian calculation involving the number of exact matches of correct and incorrect lengths. transformations which, when applied to clue-target pairs in the cwdb, generates other clue-target pairs in the database. when faced with a new clue, it applies all applicable transformations and returns the results, weighted based on the previous precision/recall of these transformations. transformations in the database include single-word substitution, removing one phrase from the beginning or end of a clue and adding another phrase to the beginning or end of the clue, depluralizing a word in the clue and pluralizing the associated target, and others. the following is a list of several non-trivial examples from the tens of thousands of transformations learned: crossword clues present an interesting challenge to traditional information retrieval (ir) techniques. while queries of similar length to clues have been studied, the ``documents'' to be returned are quite different (words or short sequences of words). in addition, the queries themselves are often purposely phrased to be ambiguous, and never share words with the ``documents'' to be returned. despite these differences, it seemed natural to try a variety of existing ir techniques over several document collections. this module is based on an indexed set of encyclopedia articles. for each query term, we compute a distribution of terms ``close'' to the query term in the text. a term is counted $10-k$ times in this distribution for every time it appears at a distance of $k<10$ words away from the query term. a term is also counted once if it appears in an article for which the query term is in the title, or vice versa. terms of the correct target length are assigned scores proportional to their frequencies in the ``close'' distribution, divided by their frequency in the corpus. the distribution of scores is normalized to 1. if a query contains multiple terms, the score distributions are combined linearly according to the log inverse frequency of the query terms in the corpus. if the query contains very common terms such as ``as'' and ``and,'' they are ignored. a vector space with one dimension for every word in the dictionary. a clue is represented as a vector in this space. for each word $w$ a clue contains, it gets a component in dimension $w$ of magnitude for a clue $c$, we find all clues in the cwdb that share words with $c$. for each such clue, we give its target a weight based on the dot product of the clue with $c$. the assigned weight is geometrically this dot product. vector space model that uses singular value decomposition to identify correlations between words. lsi has been successfully applied to the which is closely related to solving crossword clues. our lsi modules were trained on cwdb (all clues with the same target were treated as a document) and separately on an online encyclopedia and returned the closest words (by cosine) with each clue. the dijkstra modules were inspired by the intuition that related words either co-occur with one another or co-occur with similar words. this suggests a measure of relatedness based on graph distance. from a selected set of text databases, the module builds a weighted directed graph on the set of all terms. for each database $d$ and each pair of terms $(t,u)$ that co-occur in the same document, we place an edge from $t$ to $u$ in the graph with weight, for a one-word clue $t$, we assign a term $u$ a score of we find the highest scoring terms with a shortest-path-like search. for a multi-word clue, we break the clue into individual terms and add the scores as computed above. the four dijkstra modules in our system use variants of this technique. for databases, we used an encyclopedia index, two thesauri, a database of wordforms and the cwdb. shows. this module looks for a number of patterns in the clue (e.g. quoted titles as in copy of the database in a variety of forms. these modules use simple pattern matching of the clue (looking for keywords ``city'', ``author'',``band'' and others as in database. the literary database is culled from both online and encyclopedia resources. the geography database is from the getty information institute, with additional data supplied from online lists. there are four distinct synonym modules, based on three different looks for root forms of words in the clue, and then finds a variety of of relevance feedback is used to generate lists of synonyms of synonyms. finally, if necessary, the forms of the related words are coverted back to the form of the original clue word (number, tense, over five percent of all clues in cwdb have a blank in them. we searched a variety of databases to find clue patterns with a missing word (music, geography, movies, literary and quotes). for example, any four charaters to fill the blanks, including multiple words. in some of our pretests we also ran these searches over more general sources of text like encyclopedias and archived news feeds, but for efficiency, we left these out of the final runs. ``kind of'' clues are similar to fill-in-the-blank clues in that they involve pattern matching over short phrases. we identified over 50 cues that indicate a clue of this type, for example, and after each expert module has generated a weighted candidate list, we must somehow merge these into a unified candidate list with a common weighting scheme for the solver. this problem is similar to the problem facing meta-crawler search engines in that separately weighted return lists must be combined in a sensible way. an advantage of this domain is ready access to precise and abundant training data. for a given clue, each expert module $m$ returns a weighted set of candidates and a numerical level of confidence that the correct target is in this set. for each expert module $m$, we set three real clue, we reweight the candidate set by raising each weight to the probability distribution over candidates, we linearly combine the modified candidate sets of all the modules weighted by their modified confidence levels, and normalize the sum to 1. control over how the information returned by an expert module is incorporated into the final candidate list. we set these parameters using a naive hill-climbing technique. the objective function for optimization is the average log probability assigned to the correct target. this corresponds to maximizing the average log probability assigned by the solver to the correct puzzle fill-in, since in our model the probability of a puzzle solution is proportional to the product of the prior probabilities on the answers in each of the slots. the optimal value we achieve on the 70 puzzle after realizing how much repetition occurs in crosswords, in both targets and clues, and therefore how well the cwdb covers the domain, one might wonder whether this coverage is enough to constrain the domain to such an extent that there is not much for the grid-filling algorithm to do. we did not find this to be the case. simplistic grid filling yielded only mediocre results. as a measure of the task left to the grid-filling algorithm, on the first iteration of solving, of targets are in the top of the candidate list for their slot. however, the grid-filling algorithm is able to raise this to filling as an optimization problem. in particular, the across and down letter intersections establish constraints on how the grid can be filled, and crossword-puzzle filling is often cited as a constraint satisfaction problem. however, in our case, we don't just want to find ``best'' fit. we can define ``best'' in several different ways, but in these tests we attempted to maximize the expected overlap with the creator's solution, in terms of words correct. other definitions of ``best'' include maximizing the probability of getting the entire puzzle correct, or maximizing expected letter overlap. the decision to use expected word overlap is motivated by the scoring system used in human tournaments (see below). since finding the optimal solution to this problem is intractable, we employ a variety of efficient approximations. our probability measure assigns probability zero to a target that is suggested by no module and probability zero to all solutions containing that target. therefore, we need to assign non-zero probability to all letter sequences. clearly, there are too many to actually list explicitly. we augmented the solver to reason with probability distributions over candidate lists that are implicitly additional candidates once the solver can give them more information about letter probability distributions over the slot. the most important of these is a letter bigram module, which ``generates'' all possible letter sequences of the given length by returning a letter bigram distribution over all possible strings, learned from the cwdb. because the bigram probabilities are used throughout the solution process, this module is actually tightly integrated into the solver itself. no module except bigram is returning the target. in a pretest run on 70 puzzles, the clue-target with the lowest probability was times, and it gets a particularly low probability because of the many unlikely letter pairs in the target. once the grid-filling process is underway, we have probability distributions for each letter in these longer targets and this can limit our search for candidates. to address longer, multiword targets, we created free-standing implicit distribution modules. each implicit distribution module takes a letter probability distribution for each letter of the slot (computed within the solver), and returns weighted candidate lists. these lists are then added to the previous candidate lists, and the grid-filling algorithm continues. this process of getting new candidates can happen several times during the solution process. the tetragram module suggests candidates based on a letter tetragram model, built from the wordlist-big. we hoped this would provide a better model for word boundaries than the bigram model mentioned above, since this list contains many multiword terms. the segmenter calculates the $n$ most probable word sequences with respect to both the letter probabilities and word probabilities from several sources ($n = 10$ currently) using dynamic programming. the base word probabilities are unigram word probabilities from the cwdb. in addition, the dijkstra module (described above) suggests the best 1000 words (with weights) given the current clue. these weights and the unigram probabilities are then combined for a new distribution of word probabilities. combined word distribution, the segmenter returned the following top examining the clue raises the probabilites of related words like daily puzzles, and on the most recent human tournament puzzles. we tested the system on puzzles from seven daily sources, listed in the other sources were all from between august and december of 1998. we selected 70 puzzles, 10 from each source, as training puzzles for the system. the reweighting process described above was trained on the 5374 clues from these 70 puzzles. additional debugging and modification of the modules was done after evaluation on these training puzzles. having fixed the modules and reweighting parameters, we then ran the system on the 370 puzzles in the final pool. addition, we split the nyt puzzles into two groups: monday through wednesday (mtw), and thursday through sunday (tfss). as noted earlier, there is an effort made at the nyt to make puzzles increasingly difficult as the week progresses, and with respect to earlier tests, there appeared to be a finer day-by-day trend from monday to saturday, but there is not enough data in this set (10 per to better gauge the system's performance against humans, we tested for 20 years, and was attended in 1998 by 252 people. the scoring system for the acpt requires that a time limit be set for each puzzle. a solver's score is then 10 times the number of words correct, plus a bonus of 150 if the puzzle is completely correct. in addition, the number of incorrect letters is subtracted from the full minutes early the solver finishes. if this number is positive, it is multiplied by 25 and added to the score. there were seven puzzles in the offical contest, with time limits described in the previous section. the results over the 1998 puzzles competition finished all puzzles correctly, and the winner was determined by finishing time (the champion averaged under seven minutes per puzzle). thus, while not competitive with the its score on puzzle~5 exceeded that of the median human solver at the contest. the acpt puzzles are very challenging, and include tricks like multiple letters or words written in a single grid cell, and targets not produce answers that bend the rules in this way, it still word score on these puzzles, but brought down the tournament score because it works more slowly. the theoretical score if the solver did every puzzle in under a solving crossword puzzles presents a unique artificial intelligence challenge, demanding from a competitive system broad world knowledge, powerful constraint satisfaction, and speed. because of the widespread appeal, system designers have a large number of existing puzzles to use to test and tune their systems, and humans with whom to compare. a successful crossword solver requires many artificial intelligence techniques; in our work, we used ideas from state-space search, probabilistic optimization, constraint satisfaction, information retrieval, machine learning and natural language processing. we found probability theory a potent practical tool for organizing the system and improving performance. the level of success we acheived would probably not have been possible five years ago, as we depended on extremely fast computers with vast memory and disk storage, and used tremendous amounts of data in machine readable form. perhaps the time is ripe to use these resources to attack other problems previously deemed too challenging for ai. we received help and guidance from other members of the duke community: michael fulkerson, mark peot, robert duvall, fred horch, siddhartha chatterjee, geoff cohen, steve ruby, nabil h. mustafa, alan biermann, donald loveland, gert webelhuth, robert vila, sam dwarakanath, will portnoy, michail lagoudakis, steve majercik, syam gadde. via e-mail, will shortz and william tunstall-pedoe made considerable contributions. more stuff: * results on cryptics taken from cm * more exact number for humans solvers * count of ``other'' puzzles in table 1. * some holes in the nyt grid * architecture figure * clean up novelty plot * fill in more module details * calculation for exact match * britanica? * wordnet reference * grid-fill description genius 2000 software http://www.genius2000.com/cm.html will shortz. introduction to the new york times daily crossword puzzles, volume 47, random house, 1997. while we know of no other research investigating solving crosswords by computer, other researchers have addressed related issues. several which the computer is given a grid and a word list, and asked to find a set of words that fills the grid, while satisfying the constraints between across and down words. there is one existing commerical crossword-puzzle solver for british-style crosswords, crossword maestro, published by genuius~2000 software. this program performs quite well at answering individual thesaurus-type clues. however, it is intended as an aid to solving and therefore the grid-filling routines are not very sophisticated. in addition, the british-style clues tend to have more information in them, and more often follow one of the few dozen standard templates. while an unfair test, we did try crossword maestro on some of our american-style crosswords, and it did poorly at filling in the grid choosing particular candidates for particular slots in the puzzle will zzz why 7 letters words are harder? zzz ``commonness'' of words by day? zzz percentage unique by day? (refigure for actual test puzzles?) zzz significance? zzz measure of grid geometry/connectedness/overlap/density? zzz list databases? cluedb (cwdb), brittanica, ap, tasa, imdb this module ``generates'' all possible letter sequences of the given length by returning a letter bigram distribution. the weight of any sequence of letters is actually the probability of that sequence of letter bigrams occuring in the model. the bigrams are learned from the cwdb. (or an implicit representation thereof) we needed a way to combine the information returned by the expert modules into a probability distribution for the solver. we had to take into account that the modules vary widely in recall and accuracy. also, while the weights and confidence estimates returned by the modules contain valuable information, they use wildly different heuristic scales to convey that information. the two-stage model we have presented, first generating candidate lists and then merging and grid filling, is certainly not the only approach. for example, it's possible that modules could do a better job at suggesting targets with some limiting constraints on the slot, found during attempts at filling. zzz unlike many of the human games that the ai community has attempted in the past, crossword puzzles are the first to employ language.... ---------------------------------------- old table multiwordmatch & & & & 0.7 the candidates (including those of incorrect length) are initially weighted by estimates of their intended frequency as answers to the given clue. the resulting distribution is normalized to one, divided by the fraction of targets (nondistinct) in the cwdb which have the intended length, and then all the candidates of incorrect length are filtered out. this gives the appropriate boost in probability when the only candidate found happens to have the correct length. puzzles from seven different sources, with and with out the