An improved spelling corrector in Haskell
November 01, 2009 at 11:09 AM | categories: Uncategorized | View CommentsMy previous post came up on the Haskell Reddit yesterday, and I got some great suggestions for improvements to the program, which I've checked into GitHub.
-
Map.fromListWith
replacesList.foldl'
andMap.insertWith'
.fromListWith
appears to be specifically designed for aggregating values that share a common key (in .NET terminology it's theGroupBy
function). -
I'll commonly use a construct like
readFile "big.txt" >>= return . train . lowerWords
, to bind a monad valuereadFile "big.txt"
to a function(train . lowerWords)
, then produce a new monad value withreturn
. A shorter way to write this isfmap (train . lowerWords) (readFile "big.txt")
, or, with an import from Control.Applicative,(train . lowerWords <$> readFile "big.txt")
. -
You can replace the lambda syntax
(\w -> Map.lookup w nwords)
with an operator section,('Map.lookup' nwords)
. You might see this more commonly as(+1)
in place of(\x -> x + 1)
; it's the same thing. Edit: the challenging Markdown syntax means that you'll have to imagine the single quotes aroundMap.lookup
replaced with backticks, `.
I should have realised this at the time, but it didn't occur to me that you can use full pattern-matching syntax in the right-hand side of a Haskell list comprehension. If the match fails on some element then that element is filtered out. We can use this to improve on the Python version of the edits1
function: whereas the Python version combines its b[0]
subscripts with an if b
check, the Haskell version can use pattern matching to do both. We can also use the inits
and tails
functions to build substrings instead of brute-force applications of take
and drop
.
# Python def edits1(word): s = [(word[:i], word[i:]) for i in range(len(word) + 1)] deletes = [a + b[1:] for a, b in s if b] transposes = [a + b[1] + b[0] + b[2:] for a, b in s if len(b)>1] replaces = [a + c + b[1:] for a, b in s for c in alphabet if b] inserts = [a + c + b for a, b in s for c in alphabet] return set(deletes + transposes + replaces + inserts) > -- Haskell > edits1 word = let s = zip (inits word) (tails word) > deletes = [ a ++ y | (a, _:y ) <- s ] > transposes = [ a ++ y:x:z | (a, x:y:z) <- s ] > replaces = [ a ++ c:y | (a, _:y ) <- s, c <- alphabet ] > inserts = [ a ++ c:x | (a, x ) <- s, c <- alphabet ] > in Set.fromList $ concat [ deletes, transposes, replaces, inserts ]
The edits1
function is the densest one in both the Python and Haskell versions. It's the one where the intent of the code was least clear to me when I originally saw it, probably because of the extra clutter in the list comprehensions. In the latest Haskell revision it's more obvious what's going on.
I've consistently used the names x
, y
and z
for the first three parts of the word's suffix. Because they consistently stand for the same thing, any inconsistencies jump out at you when you read the code:
a ++ y
stands out becausex
is missing; it's been deleted. (x
isn't even mentioned in the pattern match.)a ++ y:x:z
stands out becausex
andy
are the wrong way round; we're transposing thema ++ c:y
stands out because we've gotc
instead ofx
;x
has been replaced. (Again,x
has dropped out of the pattern match.)a ++ c:x
hasc
in front ofx
; it's been inserted
After I'd written the first Haskell version I was skeptical. The Python version uses some typical Python idioms, like if b
for checking for empty strings, using the same syntax for accessing lists and sets, and overloading the or
operator to look for the first non-empty set, and Haskell doesn't have these same shortcuts. However the latest set of revisions makes the the core algorithm in the Haskell version more readable than the Python equivalent.