The Zen of Spell-Checking
To a spell-checking program, "boatman" is as good as "Obama."
The source: “Who Checks the Spell-Checkers?” by Chris Wilson, in Slate, Dec. 31, 2008.
Even the cockiest grammarian can be intimidated by the wavy red underline that signals a misspelled word in most word processing programs. But when Microsoft Word’s spell-check routinely suggested that future president Barack Obama’s last name be “corrected” to “Boatman” well into 2007, it made the widely used software program seem ridiculous.
Spell-checking doesn’t need to be so backward, writes Chris Wilson, an assistant editor at Slate. All the technology needed to produce a timely spelling database already exists in search engines such as Google and Microsoft’s own Live Search. Part of the reason for the disparity between the nimbleness of Google and the torpor of Microsoft Word’s spell-check—and even that of Google’s online word processor Google Docs—is that word processors and search engines try to do different things. Search engines tackle inquiries as broad as human curiosity; word processors are conservative, limiting their lexicons to words that are strictly kosher.
The two technologies update their dictionaries differently, Wilson says. Ten years ago, word processor spelling lists were compiled from web pages or old Internet queries and scrutinized by human editors in software companies. Now, Microsoft keeps on top of change by scanning trillions of words in e-mail messages sent through its Hotmail service, gleaning such terms as “Netflix,” “Radiohead,” “Lipitor,” and “all-nighter,” but its spell checker—still overseen by relatively slow-moving humans—makes surprising errors.
Google automates its word harvesting, trolling the Web to discover new words that show up with “any appreciable frequency.” Wilson found that Google offered alternate spellings for a word after it appeared only a small number of times, and was able to correct several misspellings of the unusual word “theothanatology”—the study of the death of God—when it had appeared online only 829 times.
A word is spelled correctly more often than not, so frequency of its usage is Google’s first cut for correctness. The best algorithms can identify a mistake—and suggest a cure—even when each word is spelled correctly but the context is wrong. Typing “golf war” into a Google search box returns some results for “Gulf war” as well, Wilson notes. The method does have its pitfalls, though. If it were used as a spell-checker, more naughty words might make it through; plus, a few instances of “Dalmation” (coast or dog) might turn up because the incorrect spelling with an o is almost as common as the correct “Dalmatian.”
But it would produce much better results than the primary “edit distance” method used by most word processors. That method offers corrections by changing the fewest number of boatman” for “Obama.”