Question

我需要扫描文档并为文件中的每个字符串累积不同函数的输出。在文件的任何给定行上运行的函数取决于该行中的内容。

通过为我想要收集的每个列表完整传递文件，我可以非常低效地执行此操作。示例伪代码：

at :: B.ByteString -> Maybe Atom
at line
    | line == ATOM record = do stuff to return Just Atom
    | otherwise = Nothing

ot :: B.ByteString -> Maybe Sheet
ot line
    | line == SHEET record = do other stuff to return Just Sheet
    | otherwise = Nothing

然后，我会将这些函数映射到文件中的整个行列表中，以获得原子和表格的完整列表：

mapper :: [B.ByteString] -> IO ()
mapper lines = do
    let atoms = mapMaybe at lines
    let sheets = mapMaybe to lines
    -- Do stuff with my atoms and sheets

然而，这是低效的，因为我正在编写我想要创建的每个列表的整个字符串列表。相反，我想只在线字符串列表中映射一次，在我移动它时识别每一行，然后应用适当的函数并将这些值存储在不同的列表中。

我的心态想要这样做（伪代码）：

mapper' :: [B.ByteString] -> IO ()
mapper' lines = do
    let atoms = []
    let sheets = []
    for line in lines:
        | line == ATOM record = (atoms = atoms ++ at line)
        | line == SHEET record = (sheets = sheets ++ ot line)
    -- Now 'atoms' is a complete list of all the ATOM records
    --  and 'sheets' is a complete list of all the SHEET records

Haskell的做法是什么？我根本无法得到我的功能编程思维方式来提出解决方案。

Answer 1

首先，我认为其他人提供的答案至少可以在95％的时间内完成。通过使用适当的数据类型（或某些情况下的元组）来编码手头的问题总是好的做法。但是，有时候你真的不知道你在列表中找到了什么，在这些情况下，试图列举所有可能性是困难/耗时/容易出错的。或者，您正在编写同一类型的多个变体（手动将多个折叠内联到一个中），并且您希望捕获抽象。

幸运的是，有一些技术可以提供帮助。

框架解决方案

（有点自我宣传）

首先，各种“iteratee / enumerator”包经常提供处理这类问题的函数。我最熟悉iteratee，它可以让您执行以下操作：

import Data.Iteratee as I
import Data.Iteratee.Char
import Data.Maybe

-- first, you'll need some way to process the Atoms/Sheets/etc. you're getting
-- if you want to just return them as a list, you can use the built-in
-- stream2list function

-- next, create stream transformers
-- given at :: B.ByteString -> Maybe Atom
-- create a stream transformer from ByteString lines to Atoms
atIter :: Enumeratee [B.ByteString] [Atom] m a
atIter = I.mapChunks (catMaybes . map at)

otIter :: Enumeratee [B.ByteString] [Sheet] m a
otIter = I.mapChunks (catMaybes . map ot)

-- finally, combine multiple processors into one
-- if you have more than one processor, you can use zip3, zip4, etc.
procFile :: Iteratee [B.ByteString] m ([Atom],[Sheet])
procFile = I.zip (atIter =$ stream2list) (otIter =$ stream2list)

-- and run it on some data
runner :: FilePath -> IO ([Atom],[Sheet])
runner filename = do
  resultIter <- enumFile defaultBufSize filename $= enumLinesBS $ procFile
  run resultIter

这给您带来的好处是额外的可组合性。您可以根据需要创建变换器，并将它们与zip组合。如果你愿意，你甚至可以并行运行消费者（虽然只有当你在IO monad工作，并且可能不值得，除非消费者做了很多工作），改为：

import Data.Iteratee.Parallel

parProcFile = I.zip (parI $ atIter =$ stream2list) (parI $ otIter =$ stream2list)

这样做的结果与单个for循环不同 - 这仍然会执行多次遍历数据。但是，遍历模式已经改变。这将一次加载一定量的数据（defaultBufSize字节）并多次遍历该块，并根据需要存储部分结果。完全消耗了一个块后，下一个块加载，旧的块可以被垃圾收集。

希望这能证明不同之处：

Data.List.zip:
  x1 x2 x3 .. x_n
                   x1 x2 x3 .. x_n

Data.Iteratee.zip:
  x1 x2      x3 x4      x_n-1 x_n
       x1 x2      x3 x4           x_n-1 x_n

如果你做的工作足够平行，那么这根本不是问题。由于内存局部性，性能比整个输入上的多次遍历要好得多Data.List.zip。

美丽的解决方案

如果单一遍历解决方案确实最有意义，您可能会对Max Rabkin的Beautiful Folding帖子和Conal Elliott的followup work（this too）感兴趣。基本的想法是，您可以创建数据结构来表示折叠和拉链，并且组合这些可以创建一个新的组合折叠/拉链功能，只需要一次遍历。对于Haskell初学者来说，这可能有点先进，但既然你正在考虑这个问题，你可能会觉得它很有趣或有用。 Max的帖子可能是最好的起点。

Answer 2

我展示了两种类型的行的解决方案，但是通过使用五元组而不是两元组，它很容易扩展到五种类型的行。

import Data.Monoid

eachLine :: B.ByteString -> ([Atom], [Sheet])
eachLine bs | isAnAtom bs = ([ {- calculate an Atom -} ], [])
            | isASheet bs = ([], [ {- calculate a Sheet -} ])
            | otherwise = error "eachLine"

allLines :: [B.ByteString] -> ([Atom], [Sheet])
allLines bss = mconcat (map eachLine bss)

魔术是由mconcat Data.Monoid（包含在GHC中）完成的。

（关于风格点：我个人会定义Line类型，parseLine :: B.ByteString -> Line函数并写eachLine bs = case parseLine bs of ...。但这是你问题的外围。）

Answer 3

引入新的ADT是个好主意，例如： “摘要”而不是元组。然后，既然你想积累Summary的值，你就会把它作为Data.Monoid的一个等值。然后使用分类器函数（例如isAtom，isSheet等）对每一行进行分类，并使用Monoid的mconcat函数将它们连接在一起（如@ dave4420所示）。

这是代码（它使用String而不是ByteString，但它很容易更改）：

module Classifier where

import Data.List
import Data.Monoid

data Summary = Summary
  { atoms :: [String]
  , sheets :: [String]
  , digits :: [String]
  } deriving (Show)

instance Monoid Summary where
  mempty = Summary [] [] []
  Summary as1 ss1 ds1 `mappend` Summary as2 ss2 ds2 =
    Summary (as1 `mappend` as2)
            (ss1 `mappend` ss2)
            (ds1 `mappend` ds2)

classify :: [String] -> Summary
classify = mconcat  . map classifyLine

classifyLine :: String -> Summary
classifyLine line
  | isAtom line  = Summary [line] [] [] -- or "mempty { atoms = [line] }"
  | isSheet line = Summary [] [line] []
  | isDigit line = Summary [] [] [line]
  | otherwise    = mempty -- or "error" if you need this  

isAtom, isSheet, isDigit :: String -> Bool
isAtom = isPrefixOf "atom"
isSheet = isPrefixOf "sheet"
isDigit = isPrefixOf "digits"

input :: [String]
input = ["atom1", "sheet1", "sheet2", "digits1"]

test :: Summary
test = classify input

Answer 4

如果您只有2个选项，使用Either可能是个好主意。在这种情况下，组合您的函数，映射列表，并使用左侧和权限来获得结果：

import Data.Either

-- first sample function, returning String
f1 x = show $ x `div` 2

-- second sample function, returning Int
f2 x = 3*x+1

-- combined function returning Either String Int
hotpo x = if even x then Left (f1 x) else Right (f2 x)

xs = map hotpo [1..10] 
-- [Right 4,Left "1",Right 10,Left "2",Right 16,Left "3",Right 22,Left "4",Right 28,Left "5"]

lefts xs 
-- ["1","2","3","4","5"]

rights xs
-- [4,10,16,22,28]

Haskell：扫描列表并为每个元素应用不同的函数

4 个答案:

框架解决方案

美丽的解决方案