我正在尝试学习Haskell,在关于Markov文本链的reddit文章之后,我决定首先在Python中实现Markov文本生成,现在在Haskell中实现。但是我注意到我的python实现比Haskell版本更快,甚至Haskell编译为本机代码。我想知道我应该做些什么来使Haskell代码运行得更快,而且现在我认为它因为使用Data.Map而不是hashmaps而慢得多,但我不确定
我也会发布Python代码和Haskell。使用相同的数据,Python需要大约3秒钟,Haskell接近16秒。
不言而喻,我会接受任何建设性的批评:)。
import random
import re
import cPickle
class Markov:
def __init__(self, filenames):
self.filenames = filenames
self.cache = self.train(self.readfiles())
picklefd = open("dump", "w")
cPickle.dump(self.cache, picklefd)
picklefd.close()
def train(self, text):
splitted = re.findall(r"(\w+|[.!?',])", text)
print "Total of %d splitted words" % (len(splitted))
cache = {}
for i in xrange(len(splitted)-2):
pair = (splitted[i], splitted[i+1])
followup = splitted[i+2]
if pair in cache:
if followup not in cache[pair]:
cache[pair][followup] = 1
else:
cache[pair][followup] += 1
else:
cache[pair] = {followup: 1}
return cache
def readfiles(self):
data = ""
for filename in self.filenames:
fd = open(filename)
data += fd.read()
fd.close()
return data
def concat(self, words):
sentence = ""
for word in words:
if word in "'\",?!:;.":
sentence = sentence[0:-1] + word + " "
else:
sentence += word + " "
return sentence
def pickword(self, words):
temp = [(k, words[k]) for k in words]
results = []
for (word, n) in temp:
results.append(word)
if n > 1:
for i in xrange(n-1):
results.append(word)
return random.choice(results)
def gentext(self, words):
allwords = [k for k in self.cache]
(first, second) = random.choice(filter(lambda (a,b): a.istitle(), [k for k in self.cache]))
sentence = [first, second]
while len(sentence) < words or sentence[-1] is not ".":
current = (sentence[-2], sentence[-1])
if current in self.cache:
followup = self.pickword(self.cache[current])
sentence.append(followup)
else:
print "Wasn't able to. Breaking"
break
print self.concat(sentence)
Markov(["76.txt"])
-
module Markov
( train
, fox
) where
import Debug.Trace
import qualified Data.Map as M
import qualified System.Random as R
import qualified Data.ByteString.Char8 as B
type Database = M.Map (B.ByteString, B.ByteString) (M.Map B.ByteString Int)
train :: [B.ByteString] -> Database
train (x:y:[]) = M.empty
train (x:y:z:xs) =
let l = train (y:z:xs)
in M.insertWith' (\new old -> M.insertWith' (+) z 1 old) (x, y) (M.singleton z 1) `seq` l
main = do
contents <- B.readFile "76.txt"
print $ train $ B.words contents
fox="The quick brown fox jumps over the brown fox who is slow jumps over the brown fox who is dead."
答案 0 :(得分:11)
a)你是如何编译它的? (ghc -O2?)
b)哪个版本的GHC?
c)Data.Map非常有效,但你可以被欺骗进入延迟更新 - 使用insertWith',而不是insertWithKey。
d)不要将bytestrings转换为String。将它们保存为字节串,并将它们存储在Map
中答案 1 :(得分:9)
Data.Map
的设计假设类Ord
比较需要恒定的时间。对于字符串键,情况可能并非如此 - 当字符串相等时,情况绝对不是这样。您可能会也可能不会遇到此问题,具体取决于您的语料库的大小以及有多少单词具有共同的前缀。
我很想尝试一种旨在使用序列键操作的数据结构,例如bytestring-trie
所建议的Don Stewart包。
答案 2 :(得分:7)
我尽量避免做任何花哨或微妙的事情。这只是进行分组的两种方法;第一个强调模式匹配,第二个强调不匹配。
import Data.List (foldl')
import qualified Data.Map as M
import qualified Data.ByteString.Char8 as B
type Database2 = M.Map (B.ByteString, B.ByteString) (M.Map B.ByteString Int)
train2 :: [B.ByteString] -> Database2
train2 words = go words M.empty
where go (x:y:[]) m = m
go (x:y:z:xs) m = let addWord Nothing = Just $ M.singleton z 1
addWord (Just m') = Just $ M.alter inc z m'
inc Nothing = Just 1
inc (Just cnt) = Just $ cnt + 1
in go (y:z:xs) $ M.alter addWord (x,y) m
train3 :: [B.ByteString] -> Database2
train3 words = foldl' update M.empty (zip3 words (drop 1 words) (drop 2 words))
where update m (x,y,z) = M.alter (addWord z) (x,y) m
addWord word = Just . maybe (M.singleton word 1) (M.alter inc word)
inc = Just . maybe 1 (+1)
main = do contents <- B.readFile "76.txt"
let db = train3 $ B.words contents
print $ "Built a DB of " ++ show (M.size db) ++ " words"
我认为它们都比原版快,但不可否认,我只是针对我找到的第一个合理的语料库尝试过它们。
EDIT 根据Travis Brown的非常有效的观点,
train4 :: [B.ByteString] -> Database2
train4 words = foldl' update M.empty (zip3 words (drop 1 words) (drop 2 words))
where update m (x,y,z) = M.insertWith (inc z) (x,y) (M.singleton z 1) m
inc k _ = M.insertWith (+) k 1
答案 3 :(得分:3)
这是基于foldl'
的版本,其速度似乎是train
的两倍:
train' :: [B.ByteString] -> Database
train' xs = foldl' (flip f) M.empty $ zip3 xs (tail xs) (tail $ tail xs)
where
f (a, b, c) = M.insertWith (M.unionWith (+)) (a, b) (M.singleton c 1)
我在Project Gutenberg Huckleberry Finn(我假设是你的76.txt
)上尝试过它,它产生与你的函数相同的输出。我的时间比较是非常不科学的,但这种方法可能值得一看。
答案 4 :(得分:2)
1)我不清楚你的代码。 a)您定义“狐狸”但不使用它。你是否意味着我们试图帮助你使用“狐狸”而不是阅读文件? b)你将其声明为“模块Markov”,然后在模块中有一个“main”。 c)不需要System.Random。如果您在发布之前清理代码,它确实可以帮助我们。
2)使用ByteStrings和Don说的一些严格的操作。
3)使用-O2编译并使用-fforce-recomp以确保您实际重新编译了代码。
4)尝试这种轻微的转换,它的工作速度非常快(0.005秒)。显然输入非常小,所以你需要提供你的文件或者自己测试一下。
{-# LANGUAGE OverloadedStrings, BangPatterns #-}
module Main where
import qualified Data.Map as M
import qualified Data.ByteString.Lazy.Char8 as B
type Database = M.Map (B.ByteString, B.ByteString) (M.Map B.ByteString Int)
train :: [B.ByteString] -> Database
train xs = go xs M.empty
where
go :: [B.ByteString] -> Database -> Database
go (x:y:[]) !m = m
go (x:y:z:xs) !m =
let m' = M.insertWithKey' (\key new old -> M.insertWithKey' (\_ n o -> n + 1) z 1 old) (x, y) (M.singleton z 1) m
in go (y:z:xs) m'
main = print $ train $ B.words fox
fox="The quick brown fox jumps over the brown fox who is slow jumps over the brown fox who is dead."
答案 5 :(得分:1)
正如Don所建议的那样,请考虑使用函数的更严格版本:insertWithKey'(和M.insertWith',因为你无论如何都忽略了密钥参数)。
看起来您的代码可能会累积大量的内容,直到它到达[String]
的末尾。
退房:http://book.realworldhaskell.org/read/profiling-and-optimization.html
...特别是尝试绘制堆图(大约在本章的一半)。有兴趣了解你的想法。