我试图围绕并行策略。我想我理解每个组合器的作用,但每次尝试使用超过1个核心时,程序都会大大减慢。
例如前一段时间我尝试从~700个文档中计算直方图(以及来自它们的独特单词)。我认为使用文件级粒度是可以的。使用-N4
我得到1.70的工作余额。但是对于-N1
,它的运行时间是-N4
的一半。我不确定这个问题究竟是什么,但我想知道如何决定何时/何时/如何并行化并获得一些理解。如何将其并行化,以便速度随核心而不是降低而增加?
import Data.Map (Map)
import qualified Data.Map as M
import System.Directory
import Control.Applicative
import Data.Vector (Vector)
import qualified Data.Vector as V
import qualified Data.Text as T
import qualified Data.Text.IO as TI
import Data.Text (Text)
import System.FilePath ((</>))
import Control.Parallel.Strategies
import qualified Data.Set as S
import Data.Set (Set)
import GHC.Conc (pseq, numCapabilities)
import Data.List (foldl')
mapReduce stratm m stratr r xs = let
mapped = parMap stratm m xs
reduced = r mapped `using` stratr
in mapped `pseq` reduced
type Histogram = Map Text Int
rootDir = "/home/masse/Documents/text_conversion/"
finnishStop = ["minä", "sinä", "hän", "kuitenkin", "jälkeen", "mukaanlukien", "koska", "mutta", "jos", "kuitenkin", "kun", "kunnes", "sanoo", "sanoi", "sanoa", "miksi", "vielä", "sinun"]
englishStop = ["a","able","about","across","after","all","almost","also","am","among","an","and","any","are","as","at","be","because","been","but","by","can","cannot","could","dear","did","do","does","either","else","ever","every","for","from","get","got","had","has","have","he","her","hers","him","his","how","however","i","if","in","into","is","it","its","just","least","let","like","likely","may","me","might","most","must","my","neither","no","nor","not","of","off","often","on","only","or","other","our","own","rather","said","say","says","she","should","since","so","some","than","that","the","their","them","then","there","these","they","this","tis","to","too","twas","us","wants","was","we","were","what","when","where","which","while","who","whom","why","will","with","would","yet","you","your"]
isStopWord :: Text -> Bool
isStopWord x = x `elem` (finnishStop ++ englishStop)
textFiles :: IO [FilePath]
textFiles = map (rootDir </>) . filter (not . meta) <$> getDirectoryContents rootDir
where meta "." = True
meta ".." = True
meta _ = False
histogram :: Text -> Histogram
histogram = foldr (\k -> M.insertWith' (+) k 1) M.empty . filter (not . isStopWord) . T.words
wordList = do
files <- mapM TI.readFile =<< textFiles
return $ mapReduce rseq histogram rseq reduce files
where
reduce = M.unions
main = do
list <- wordList
print $ M.size list
至于文本文件,我正在使用pdf转换为文本文件,因此我无法提供它们,但出于此目的,几乎所有来自项目gutenberg的书/书应该这样做。
修改:添加了对脚本的导入
答案 0 :(得分:4)
我认为丹尼尔做得对 - Data.Map和列表是一个懒惰的数据结构;你应该使用foldl'和 insertWith'来确保每个块的工作都是热切的 - 否则所有工作都会延迟到顺序部分(减少)。
同样不明显的是,为每个文件制作一个火花是正确的粒度,特别是如果文件大小差别很大。如果是这种情况,最好连接单词列表并拆分成偶数大小的块(参见parListChunk组合器)。
当你遇到它时,我还会看到使用惰性IO(readFile)来打开许多文件的一些缺陷(运行时系统可能会用完文件句柄,因为它会长时间保留它们)。
答案 1 :(得分:4)
实际上,让并联组合器很好地扩展可能很困难。 其他人提到让你的代码更严格,以确保你实际上 并行工作,这绝对是重要的。
可以真正杀死性能的两件事是大量的内存遍历和
垃圾收集。即使你不是生产大量的垃圾,也不是很多
内存遍历给CPU缓存带来了更大的压力,最终还是你的
记忆巴士成为瓶颈。您的isStopWord
功能执行了很多
字符串比较,并且必须遍历一个相当长的链表才能这样做。
使用内置Set
类型可以节省大量工作,甚至更好
来自HashSet
包的unordered-containers
类型(因为重复的字符串
比较可能很昂贵,特别是如果他们共享公共前缀)。
import Data.HashSet (HashSet)
import qualified Data.HashSet as S
...
finnishStop :: [Text]
finnishStop = ["minä", "sinä", "hän", "kuitenkin", "jälkeen", "mukaanlukien", "koska", "mutta", "jos", "kuitenkin", "kun", "kunnes", "sanoo", "sanoi", "sanoa", "miksi", "vielä", "sinun"]
englishStop :: [Text]
englishStop = ["a","able","about","across","after","all","almost","also","am","among","an","and","any","are","as","at","be","because","been","but","by","can","cannot","could","dear","did","do","does","either","else","ever","every","for","from","get","got","had","has","have","he","her","hers","him","his","how","however","i","if","in","into","is","it","its","just","least","let","like","likely","may","me","might","most","must","my","neither","no","nor","not","of","off","often","on","only","or","other","our","own","rather","said","say","says","she","should","since","so","some","than","that","the","their","them","then","there","these","they","this","tis","to","too","twas","us","wants","was","we","were","what","when","where","which","while","who","whom","why","will","with","would","yet","you","your"]
stopWord :: HashSet Text
stopWord = S.fromList (finnishStop ++ englishStop)
isStopWord :: Text -> Bool
isStopWord x = x `S.member` stopWord
用此版本替换isStopWord
功能的效果要好得多
并且扩展得更好(尽管绝对不是1-1)。你也可以考虑一下
使用HashMap
(来自同一个包)而不是Map
出于同样的原因,
但是这样做没有明显改变。
另一种选择是增加默认堆大小以获取一些
压缩GC并为其提供更多移动空间。给予
编译代码的默认堆大小为1GB(-H1G
标志),我得到了GC余额
在4个核心上大约50%,而我只有25%没有(它也运行~30%)
快)。
通过这两个更改,四个核心(在我的机器上)的平均运行时间 从~10.5s下降到~3.5s。可以说,基于此还有改进的余地 GC统计数据(仍然只花费58%的时间从事生产性工作), 但做得更好可能需要更大幅度的改变 你的算法。