我要创建一个直方图,该直方图计算世界上排名前20位的最常见单词。这是我得到的结果:
import Data.List(sort, group, sortBy)
toWordList = words
countCommonWords wordList = length (filter isCommon wordList)
where isCommon word = elem word commonWords
dropCommonWords wordList = filter isUncommon wordList
where isUncommon w = notElem w commonWords
commonWords = ["the","and","have","not","as","be","a","I","on", "you","to","in","it","with","do","of","that","for","he","at"]
countWords wordList = map (\w -> (head w, length w)) $group $ sort wordList
compareTuples (w1, n1) (w2, n2) = if n1 < n2 then LT else if n1> n2 then GT else EQ
sortWords wordList = reverse $ sortBy compareTuples wordList
toAsteriskBar x = (replicate (snd x) '*') ++ " -> " ++ (fst x) ++ "\n"
makeHistogram wordList = concat $ map toAsteriskBar (take 20 wordList)
--Do word list
text = "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way--in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only. there were a king with a large jaw and a queen with a plain face, on the throne of England; there were a king with a large jaw and a queen with a fair face, on the throne of France. In both countries it was clearer than crystal to the lords of the State preserves of loaves and fishes, that things in general were settled for ever of."
main = do
let wordlist = toWordList text
putStrLn "Report:"
putStrLn ("\t" ++ (show $ length wordlist) ++ " words")
putStrLn ("\t" ++ (show $ countCommonWords wordlist) ++ " common words")
putStrLn "\nHistogram of the most frequent words (excluding common words):\n"
putStr $ makeHistogram $ sortWords $ countWords $ dropCommonWords $ wordlist
结果:
Report:
186 words
71 common words
Histogram of the most frequent words (excluding common words):
************ -> was
***** -> were
**** -> we
** -> us,
** -> times,
** -> throne
** -> there
** -> season
** -> queen
** -> large
** -> king
** -> jaw
** -> its
** -> had
** -> going
** -> face,
** -> epoch
** -> direct
** -> before
** -> all
有人知道计数器为什么要计数带有撇号的单词吗? us,
整个词?
答案 0 :(得分:1)
简介
toWordList = words
这是我要修改以清除您的文字的功能。例如,toWordList = map (filter isAlpha) . words
使得您只获得字母单词中的那些字符,而不是用空格分隔的所有字符块(words
的作用)。编辑:isAlpha
来自Data.Char
模块,您需要将其导入。编辑了以上代码段,也添加了map
。
史诗
向前迈进,我只是要做一些代码注释,因为为什么不这样。
import Data.List(sort, group, sortBy)
是的,使用预先存在的代码。您可能还需要comparing
中的Data.Ord
。
countCommonWords wordList = length (filter isCommon wordList)
where isCommon word = elem word commonWords
dropCommonWords wordList = filter isUncommon wordList
where isUncommon w = notElem w commonWords
这些操作是O(n * m) where
n is the length of wordList and
m is the length of
常用字`。如果需要,可以使用Set来加快此过程。
commonWords = ["the","and","have","not","as","be","a","I"
,"on","you","to","in","it","with","do","of","that"
,"for","he","at"]
countWords wordList = map (\w -> (head w, length w)) $ group $ sort wordList
此处有类似的效果评论。一种常见的方法是使用Data.Map.insertWith
为每个单词保留一个计数器。
compareTuples (w1, n1) (w2, n2) = if n1 < n2 then LT else if n1> n2 then GT else EQ
这很容易拼写compareTuples = comparing fst