我想从Google N-Grams datasetä¸æå–最常用的å•è¯ï¼Œå…¶æœªåŽ‹ç¼©æ ¼å¼çº¦ä¸º20 GB。我ä¸æƒ³è¦æ•´ä¸ªæ•°æ®é›†ï¼Œæœ€å¸¸è§çš„5000个。但如果我写
take 5000 $ sortBy (flip $ comparing snd) dataset
-- dataset :: IO [(word::String, frequency::Int)]
è¿™å°†æ˜¯ä¸€ä¸ªæ— ä¼‘æ¢çš„ç‰å¾…。但我该怎么åšå‘¢ï¼Ÿ
我知é“有Data.Array.MArray包å¯ç”¨äºŽå°±åœ°æ•°ç»„计算,但我在其文档页é¢ä¸Šçœ‹ä¸åˆ°ä»»ä½•ä¿®æ”¹é¡¹ç›®çš„功能。还有Data.HashTable.IOï¼Œä½†å®ƒæ˜¯æ— åºçš„æ•°æ®ç»“构。
我想使用简å•çš„Data.IntMap.Strict
(具有方便的lookupLE
功能),但我ä¸è®¤ä¸ºå®ƒä¼šéžå¸¸æœ‰æ•ˆï¼Œå› 为它会在æ¯ä¸ªåœ°æ–¹ç”Ÿæˆä¸€ä¸ªæ–°åœ°å›¾æ”¹é€ 。 ST
monadå¯ä»¥æ”¹å–„å—?
UPD:我还在CoreReview.SX上å‘布了最终版本的程åºã€‚
ç”案 0 :(得分:5)
æ€Žä¹ˆæ ·
splitAt
将数æ®é›†åˆ’分为å‰5000个项目和其余部分。然åŽï¼Œè¯¥è¿‡ç¨‹å˜ä¸ºæœ‰æ•ˆçº¿æ€§ï¼Œä½†å¦‚果对具有次线性min-deleteå’Œæ’入的已排åºçš„5000ä¸ªå…ƒç´ ä½¿ç”¨æ•°æ®ç»“构,则系数会得到改善。
例如,使用Data.Heap
from the heap
package:
import Data.List (foldl')
import Data.Maybe (fromJust)
import Data.Heap hiding (splitAt)
mostFreq :: Int -> [(String, Int)] -> [(String, Int)]
mostFreq n dataset = final
where
-- change our pairs from (String,Int) to (Int,String)
pairs = map swap dataset
-- get the first `n` pairs in one list, and the rest of the pairs in another
(first, rest) = splitAt n pairs
-- put all the first `n` pairs into a MinHeap
start = fromList first :: MinHeap (Int, String)
-- then run through the rest of the pairs
stop = foldl' step start rest
-- modifying the heap to replace its least frequent pair
-- with the new pair if the new pair is more frequent
step heap pair = if viewHead heap < Just pair
then insert pair (fromJust $ viewTail heap)
else heap
-- turn our heap of (Int, String) pairs into a list of (String,Int) pairs
final = map swap (toList stop)
swap ~(a,b) = (b,a)
ç”案 1 :(得分:1)
ä½ å°è¯•è¿™ä¸ªæˆ–è€…ä½ åªæ˜¯çŒœæµ‹ï¼Ÿå› 为许多Haskell排åºå‡½æ•°éƒ½å°Šé‡ laziness ï¼Œå½“ä½ åªè¦æ±‚å‰5000å时,他们会很ä¹æ„é¿å…å¯¹å…¶ä½™å…ƒç´ è¿›è¡ŒæŽ’åºã€‚
åŒæ ·ï¼Œè¦éžå¸¸å°å¿ƒï¼†ï¼ƒ34;它会在æ¯æ¬¡æ›´æ”¹æ—¶ç”Ÿæˆä¸€å¼ 新地图&#34;。在这ç§æ•°æ®ç»“æž„ä¸ï¼Œå¤§å¤šæ•°æ’å…¥æ“作都是O(log n),né™åˆ¶ä¸º5000ï¼šæ‰€ä»¥ä½ å¯èƒ½ä¼šåœ¨æ¯æ¬¡æ›´æ”¹æ—¶åœ¨å †ä¸åˆ†é…~30个新å•å…ƒæ ¼ï¼Œä½†è¿™ä¸æ˜¯ç‰¹åˆ«çš„巨大的æˆæœ¬ï¼Œè‚¯å®šä¸ä¼šåƒ5000那么大。
如果Data.List.sort
ä¸èƒ½å¾ˆå¥½åœ°è¿ä½œï¼Œé‚£ä¹ˆæ‚¨éœ€è¦çš„是:
import Data.List (foldl')
import Data.IntMap.Strict (IntMap)
import qualified Data.IntMap.Strict as IM
type Freq = Int
type Count = Int
data Summarizer x = Summ {tracking :: !IntMap [x], least :: !Freq,
size :: !Count, size_of_least :: !Count }
inserting :: x -> Maybe [x] -> Maybe [x]
inserting x Nothing = Just [x]
inserting x (Just xs) = Just (x:xs)
sizeLimit :: Summarizer x -> Summarizer x
sizeLimit skip@(Summ strs f_l tot lst)
| tot - lst < 5000 = skip
| otherwise = Summ strs' f_l' tot' lst'
where (discarded, strs') = IM.deleteFindMin strs
(f_l', new_least) = IM.findMin dps'
tot' = tot - length discarded
lst' = length new_least
addEl :: (x, Freq) -> Summarizer x -> Summarizer x
addEl (str, f) skip@(Summ strs f_l tot lst)
| i < f_l && tot >= 5000 = skip
| otherwise = sizeLimit $ Summ strs' f_l' tot' lst'
where strs' = IM.alter (inserting str) f strs
tot' = tot + 1
f_l' = min f_l f
lst' = case compare f_l f of LT -> lst; EQ -> lst + 1; GT -> 1
请注æ„,我们å˜å‚¨å—符串列表以处ç†é‡å¤é¢‘率;我们主è¦æ˜¯è·³è¿‡æ›´æ–°ï¼Œå½“我们åšæ›´æ–°æ—¶ï¼Œå®ƒä¼šè¿›è¡ŒO(log n)æ“ä½œä»¥å°†æ–°å…ƒç´ æ”¾å…¥ï¼Œæœ‰æ—¶ï¼ˆå†æ¬¡ä¾èµ–于é‡å¤ï¼‰æ“作O(log n)æ“ä½œåˆ é™¤æœ€å°çš„å…ƒç´ ï¼Œä»¥åŠO(log n)æ“作以找到新的最å°å…ƒç´ 。