An AI system for solving crossword puzzles that outperforms the best humans

80 points by DantesKite 3 years ago

Hi, I am the first author of this paper and I am happy to answer any questions. You can find a link to the technical paper here https://arxiv.org/abs/2205.09665.

mikeryan 3 years ago

Hey this is cool, I do the NYT Crossword every day. A few questions.
1. You mention an 82% solve rate. The NYT puzzle gets "harder" each day Monday through Saturday. Do you track the days separately? If so I'd be curious how much of the 18% unsolved end up on Fridays and Saturday. (for anyone who doesn't know the Sunday puzzle is outside of the M-Sat range since its a bigger puzzle).
2. Related to the above Thursday puzzles usually have "tricks" (skipped letters and what not) in them or require a Rebus (multiple letters in one space) - do you handle these at all?
3. Is this building an ongoing model and getting better at solving? Or did you have to seed it with a set of solved puzzles and clues?
Sorry didn't have time to read the whole paper.
- nickatomlin 3 years ago
  
  Hi! I'm another author on this paper. To answer your questions:
  1. Monday puzzles are the easiest for our model, and Thursdays are the most difficult. You can see a graph of day-by-day performance here: https://twitter.com/albertxu__/status/1527704535912787968
  2. Our current system doesn't have any handling for rebuses or similar tricks, although Dr. Fill does. I think this is part of why Thursday is the hardest day for us, even though Saturday is usually considered the most difficult.
  3. We trained it with 6.4M clues. As new crosswords get published, we could theoretically retrain our model with more data, but we aren't currently planning to do that.
  
  sp332 3 years ago
  
  I don't suppose you gave more weight to more recent puzzles? Is there a time period or puzzle setter that was harder to solve because they favored an unusual clue type?
  
  nickatomlin 3 years ago
  
  We didn't give more weight to recent puzzles. In fact, we trained on pre-2020 data, validated on data from 2020, and evaluated on post-2020 data.
  Our model seems to perform well despite this "time generalization" split, but there are a couple instances where it struggled with new words. For example, we got the answer "FAUCI" wrong in a puzzle from May 2021. Even though Fauci was in the news before 2020, I guess he wasn't famous enough to show up in crosswords, and therefore his name wasn't in our training data.
  I think evaluating performance by constructor would be really interesting! But we haven't done that.
Imnimo 3 years ago

For handling cross-reference clues, do you think it would be feasible in the future to feed the QA model a representation of the partially-filled puzzle (perhaps only in the refinement step - hard to do for the first step before you have any answers!), in order to give it a shot at answering clues that require looking at other answers?
It feels like the challenges might be that most clues are not cross-referential, and even for those that are, most information in the puzzle is irrelevant - you only care about one answer among many, so it could be difficult to learn to find the information you need.
But maybe this sort of thing would also be helpful for theme puzzles, where answers might be united by the theme even if their clues are not directly cross-referential, and could give enough signal to teach the model to look at the puzzle context?
twright0 3 years ago

This is super interesting work!
One thing I was curious about - the ACPT is a crossword speed-solving competition, with time spent solving a major aspect of total score. How did you approach leveling the playing field between the human and computer competitors?
mikeryan 3 years ago

Oh also do you use the puzzle Author as a data point? I wonder if there are patterns to be gleaned there.
avrionov 3 years ago

Do you think your approach can be applied to other problems?

gardenfelder 3 years ago

https://github.com/albertkx/berkeley-crossword-solver

thom 3 years ago

Note for Brits that this isn’t cryptic (dare I say ‘real’) crosswords, but I assume it could be retooled for that.

tialaramex 3 years ago

American Crosswords are different in two key ways as I understand it:
Firstly, all "serious" British crosswords are "Cryptic" ie once you figure out what the answer is, it's apparent why that's the correct clue, but figuring out the answer from the clue involves lateral thinking and some skills learned from years of staring at such clues.
e.g. Private Eye's crossword 726 (back in April), clue 23 down,
"He finally gets to penetrate agreeable person (relatively) (5)"
The correct answer is "Niece". "Nice" can mean agreeable, the final letter of "He" is E, and so by having the letter E "penetrate" the word nice you produce "niece", a person who is a relative.
[ and yes, Private Eye is a satirical magazine, the crossword clues are, likewise, intended to make you a little uncomfortable while you laugh ]
Secondly, British crosswords are arranged with black "dead" squares between letters to produce more of a lattice, in which many letters only take part in one word, as a result longer answers are common
e.g. same crossword, clue 26 across is
"Figure on getting your teeth into our statistical revelations (6,9)"
The answer was "Number Crunching".
- jen729w 3 years ago
  
  Brit here. I woke up one morning – I was 15, so this was in the 90s – with the word ‘microdot’ in my head. The first thought, clear as anything, as if it was painted across the inside of my eyes. Microdot!
  Puzzled, I didn’t move and set about figuring out why. Eventually I realised that I had solved, in my sleep, a crossword clue that I had not even gone to bed thinking about. I’d read it at my grandma’s house earlier the previous day.
  Tiny picture makes computer work on time (8)
  The brain is amazing. I’m not even any good at the cryptic crossword!
  
  schoen 3 years ago
  
  For non-cryptic solvers, this clue is parsed as follows:
  Definition: Tiny picture
  Wordplay: Computer [MICRO] + work [DO] + time [T]
- schoen 3 years ago
  
  I like solving both American-style and cryptic crosswords now, but I never realized that the British called the black squares "dead squares" in English. In my experience, this term is never used in American English, but it's the term most often used in Brazilian Portuguese crosswords (casas mortas). To my knowledge, it's also not used in Spanish (cuadros negros/celdas negras), French (cases noires), or Italian (caselle nere).
  I wonder how the dead-square terminology reached Brazil. I think the popularity of crosswords there was originally due to Italian immigrants, who might not have said "dead squares".
  I've heard people call the "letters [that] only take part in one word" unches (an abbreviation for "unchecked squares").
  
  tialaramex 3 years ago
  
  I don't know if it would be usual to call these "dead" here that was just the obvious word for them to me, as someone with relatively little crossword ability and I didn't ask an expert. I lack whatever spark it is allows several of my friends to do cryptics with great success, in Private Eye for example once it enters my mind that somehow the answer might be the slightly childish "BOOBS" it might take me days to realise "BOSOM" fits the clue much better and aha, now I can solve an adjacent clue because that fourth letter is an O not a B...
- lern_too_spel 3 years ago
  
  Why do you use only the last letter in "He?"
  
  agd 3 years ago
  
  The relevant clue there is ‘he finally’. The ‘finally’ is the part that hints at only taking the last letter of ‘he’.
  Cryptic crosswords in the UK often have hidden prepositions and clues like this.
- hackernewds 3 years ago
  
  sorry that solve for Niece made no sense. Why pick the letter E? am I daft or is it more of a function of practice on these?
  
  thom 3 years ago
  
  “He finally” means the final letter of he. “Agreeable” means nice. “Penetrates” means put the e in nice. “Person (relatively)” means a relative, i.e. niece. Clues are basically always partitioned by the definition of the thing, and how to construct the thing. When the two agree you can be confidant you’ve got the right answer.
dane-pgp 3 years ago

I'm reminded of an article I read about an AI that competed in a crossword competition and one particularly difficult clue it faced was "Apollo 11 and 12 [180 degrees]". I don't know if it would be allowed as part of a cryptic crossword, but the number of letters in the (words of the) answer were 8, 4.
The answer to that clue is included here:
https://www.uh.edu/engines/epi2783.htm
- nilstycho 3 years ago
  
  That would usually be considered an invalid cryptic clue.
jamespwilliams 3 years ago

For cryptic crosswords I’ve found https://www.crosswordgenius.com/ impressive (once you get past the kind of clunky UI)

interestica 3 years ago

If you want to use two brains, 'partner mode' for the New Yorker's crossword has been great. Two persons can work on a single puzzle in real time.

There may be other sites that allow it -- the software seems to power a few diff crossword sites (with certain features enabled/disabled).

zwieback 3 years ago

My dad was a big crossword puzzler. I asked him if he thought that if you pick one of two possible answers to the first clue whether it would be possible to solve the entire puzzle one way or another way. He sat down and created a series of puzzles with "themes", e.g. "north", "south" or "Schiller", "Goethe", where all the major words were from one or the other theme.

Anyway, it would be interesting what the AI would do with this, would there be two hotspots in the solution space, one for each variant?

mcherm 3 years ago

Also famously the November 5 1996 NYT puzzle where a clue about the newly elected president could be solved either CLINTON or BOBDOLE and all the crossing words had two solutions.
If they trained the AI on the NYT archive then they would have the results of testing it on this one.
- sp332 3 years ago
  
  It's been done a few times since then. February 6th this year had STAR TREK/STAR WARS across the center.
- zwieback 3 years ago
  
  Cool, I'll have to tell my dad about this before it's too late. His versions were many years before this but published in Germany so maybe totally independent.
evanb 3 years ago

I do the puzzle every day. I've been collecting clues that have caused me trouble, a wrong answer. I hope someday to be able to construct such a puzzle with one set of clues and two complete but incompatible solutions.
A thesaurus will get you far, but will never get you OREOCOOKIE and CHESSBOARD as answers for "It's all black and white" (from today's puzzle).

cinntaile 3 years ago

Now automatically send in the answers to the various weekly magazine and newspaper competitions to get a passive prize income.

r0b05 3 years ago

Mind blowing stuff.

mnd999 3 years ago

Anyone else getting a bit bored with all these AI does some super specialised task better than humans after enormous amounts of training. It’s not very interesting anymore.

Sure, it can do crosswords well but the average human that does crosswords well can also do a zillion other things and this type of AI is not getting us any closer to that.

DantesKite 3 years ago

If you skim the paper, you’ll realize what’s most interesting are the new techniques they developed to accomplish this, advancing the field of machine learning in the process.
- JHonaker 3 years ago
  
  Constraint propagation and search are hardly new techniques. In fact they're both from or from before the previous "AI winter." It's a shame they've been left out of so many of the deep learning generation's wheelhouses. They were considered fundamentals by the field a very short time ago.
  Also, I'm not knocking this paper at all. I think it's a great applied paper! Stitching together techniques to actually do something is a Herculean task. Hell, it's mostly what I do too.
kromem 3 years ago

Do you have any idea just how specialized the human brain is?
I can just imagine if evolution was a side spectator event, people commenting: "Broca's area just regulates breathing. And that Wernicke's area is just pattern recognition in sounds. Those aren't going to get us to anything important."
Point me to actual large generalized models in nature that aren't composed of smaller specialized functions and you might have a leg to stand on.
(Oh wait, no, those legs things are pretty specialized too, and each have their own specialized parts. Bad analogy.)
Well, good luck with your identifying an example of complex generalization without subspecialties!
joshcryer 3 years ago

But every specialized model like this is getting us closer to "doing a zillion other things." By logic it is exactly one step closer. The general AI agent will be composed of many such models.
- PeterisP 3 years ago
  
  > The general AI agent will be composed of many such models.
  Not necessarily, there is good evidence that a single model can work for many tasks - for example, the recent Gato system https://www.deepmind.com/publications/a-generalist-agent is a good example. It's just that we usually don't do that because for most practical purposes we want an agent for a specific purpose, and for most research experiments we want a simpler experiment to isolate some factor, so we usually train single-task models and don't try to make general systems.
  
  joshcryer 3 years ago
  
  Gato is a model of models, if you read the paper they trained it on multiple datasets, and is actually what I was referring to in my comment.
  
  PeterisP 3 years ago
  
  Gato is not a model of models - of course they trained it on multiple datasets, but the result is a single model that shares weights across the tasks (the key difference between single model and model-of-models architectures) so it is able to transfer knowledge from one dataset to the other datasets and also generalizes to unseen tasks on which Gato was not trained at all.
flafla2 3 years ago

> this type of AI is not getting us any closer to that.
That is not obvious at all.