Question

我有一个包含数字矩阵的文件如下：

0 10 24 10 13 4 101 ...
6 0 52 10 4 5 0 4 ...
3 4 0 86 29 20 77 294 ...
4 1 1 0 78 100 83 199 ...
5 4 9 10 0 58 8 19 ...
6 58 60 13 68 0 148 41 ...
. .
.   .
.     .

我要做的是对每一行求和并将每行的总和输出到一个新文件（新行上每行的总和）。

我已经尝试使用ByteStrings在Haskell中执行此操作，但性能是python实现的3倍。这是Haskell的实现：

import qualified Data.ByteString.Char8 as B

-- This function is for summing a row
sumrows r = foldr (\x y -> (maybe 0 (*1) $ fst <$> (B.readInt x)) + y) 0 (B.split ' ' r)

-- This function is for mapping the sumrows function to each line
sumfile f = map (\x -> (show x) ++ "\n") (map sumrows (B.split '\n' f)) 

main = do
  contents <- B.readFile "telematrix"
  -- I get the sum of each line, and then pack up all the results so that it can be written
  B.writeFile "teleDensity" $ (B.pack . unwords) (sumfile contents)
  print "complete"

25 MB文件大约需要14秒。

这是python实现

fd = open("telematrix", "r")
nfd = open("teleDensity", "w")

for line in fd: 
  nfd.write(str(sum(map(int, line.split(" ")))) + "\n")

fd.close()
nfd.close()

对于相同的25 MB文件，这需要大约5秒钟。

关于如何增加Haskell实现的任何建议？

Answer 1

似乎他的问题是我使用runhaskell编译和运行程序，而不是使用ghc然后运行程序。通过首先编译然后运行，我在Haskell中将性能提高到1秒

Answer 2

乍一看，我敢打赌，你的第一个瓶颈在于++ sumfile中的"\n"字符串，每次都会对左操作数进行解构并重建它。您可以将unwords函数调用替换为unlines，而不是将(*1)添加到结尾，而maybe函数调用完全符合您的要求。这应该会给你一个很好的速度提升。

更小的挑剔是id函数中的(*1)是不需要的。使用ByteString可以提高效率，因为ByteString会浪费乘法运算，但这不会超过几个处理器周期。

最后，我不得不问你为什么在这里使用[Char]。 B.split将字符串数据有效地存储为数组，就像更强制性语言中的传统字符串一样。但是，您在此处所做的工作包括拆分字符串并迭代元素，这些元素是链接列表适合的操作。老实说，我建议在这种情况下使用传统的words类型。那个pdfTable.SplitLate = false;调用可能会毁了你，因为它必须占用整行并将其复制到拆分形式的单独数组中，而链接字符列表的{{1}}函数只是拆分链接结构几点关闭。

Answer 3

性能不佳的主要原因是因为我使用runhaskell而不是先编译然后运行程序。所以我转自：

runhaskell program.hs

到

ghc program.hs

./program

提高文件操作的性能

3 个答案: