直方图算单字

时间:2019-01-11 16:37:50

标签: haskell

我要创建一个直方图,该直方图计算世界上排名前20位的最常见单词。这是我得到的结果:

import Data.List(sort, group, sortBy)
toWordList = words
countCommonWords wordList = length (filter isCommon wordList)
  where isCommon word = elem word commonWords

dropCommonWords wordList = filter isUncommon wordList
  where isUncommon w = notElem w commonWords


commonWords = ["the","and","have","not","as","be","a","I","on", "you","to","in","it","with","do","of","that","for","he","at"]
countWords wordList = map (\w -> (head w, length w)) $group $ sort wordList
compareTuples (w1, n1) (w2, n2) = if n1 < n2 then LT else if n1> n2 then GT else EQ

sortWords wordList = reverse $ sortBy compareTuples wordList

toAsteriskBar x = (replicate (snd x) '*') ++ " -> " ++ (fst x) ++ "\n"
makeHistogram wordList = concat $ map toAsteriskBar (take 20 wordList)


--Do word list

text = "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way--in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only. there were a king with a large jaw and a queen with a plain face, on the throne of England; there were a king with a large jaw and a queen with a fair face, on the throne of France. In both countries it was clearer than crystal to the lords of the State preserves of loaves and fishes, that things in general were settled for ever of."

main = do
  let wordlist = toWordList text
  putStrLn "Report:"
  putStrLn ("\t" ++ (show $ length wordlist) ++ " words")
  putStrLn ("\t" ++ (show $ countCommonWords wordlist) ++ " common words")
  putStrLn "\nHistogram of the most frequent words (excluding common words):\n"
  putStr $ makeHistogram $ sortWords $ countWords $ dropCommonWords  $ wordlist

结果:


Report:
    186 words
    71 common words
Histogram of the most frequent words (excluding common words):
************ -> was
***** -> were
**** -> we
** -> us,
** -> times,
** -> throne
** -> there
** -> season
** -> queen
** -> large
** -> king
** -> jaw
** -> its
** -> had
** -> going
** -> face,
** -> epoch
** -> direct
** -> before
** -> all

有人知道计数器为什么要计数带有撇号的单词吗? us,整个词?

1 个答案:

答案 0 :(得分:1)

简介

toWordList = words

这是我要修改以清除您的文字的功能。例如,toWordList = map (filter isAlpha) . words使得您只获得字母单词中的那些字符,而不是用空格分隔的所有字符块(words的作用)。编辑:isAlpha来自Data.Char模块,您需要将其导入。编辑了以上代码段,也添加了map

史诗

向前迈进,我只是要做一些代码注释,因为为什么不这样。

import Data.List(sort, group, sortBy)

是的,使用预先存在的代码。您可能还需要comparing中的Data.Ord

countCommonWords wordList = length (filter isCommon wordList)
  where isCommon word = elem word commonWords

dropCommonWords wordList = filter isUncommon wordList
  where isUncommon w = notElem w commonWords

这些操作是O(n * m) where n is the length of wordList and m is the length of常用字`。如果需要,可以使用Set来加快此过程。

commonWords = ["the","and","have","not","as","be","a","I"
              ,"on","you","to","in","it","with","do","of","that"
              ,"for","he","at"]

countWords wordList = map (\w -> (head w, length w)) $ group $ sort wordList

此处有类似的效果评论。一种常见的方法是使用Data.Map.insertWith为每个单词保留一个计数器。

compareTuples (w1, n1) (w2, n2) = if n1 < n2 then LT else if n1> n2 then GT else EQ

这很容易拼写compareTuples = comparing fst