将csv文件解析为向量向量然后将其吐出的最快方法是什么?

时间:2017-06-28 22:33:25

标签: performance haskell

我正在尝试找到重新排序csv文件列的最快方法(使用单元格中没有逗号的简单csv子集)。我正在通过Vector.backpermute进行重新排序,这很好; RTS -p指示的瓶颈是构建我执行此操作的向量向量。下面的代码是我能想到的最快的版本。有人有什么想法吗?

{-# LANGUAGE OverloadedStrings #-}
module Main where

import           Control.Applicative
import           Control.Monad
import qualified Data.ByteString            as B
import qualified Data.ByteString.Builder    as BB
import qualified Data.ByteString.Lazy       as BL
import qualified Data.ByteString.Lazy.Char8 as BL8
import           Data.Char
import           Data.Foldable
import           Data.Monoid
import qualified Data.Vector                as V
import           Data.Word
import           Debug.Trace
import           System.Environment
import           System.IO

data Args = Args { cols :: V.Vector Int, filePath :: FilePath } deriving (Show)

--
w8 = fromIntegral . ord
mconcat' :: (Foldable t, Monoid a) => t a -> a
mconcat' = foldl' (<>) mempty

parseArgs :: [String] -> Args
parseArgs [colStr, filePath] = Args ((\n -> n-1) . read <$> V.fromList (split ',' colStr)) filePath
  where split :: Char -> String -> [String]
        split d str = gosplit d str []
        gosplit d "" acc = reverse acc
        gosplit d str acc = gosplit d (drop 1 $ dropWhile (/= d) str) $ takeWhile (/= d) str : acc

reorder :: Args -> BL.ByteString -> BB.Builder
reorder (Args cols _ ) bstr =
  -- transform to vec matrix
  let rows = V.filter (not . BL.null) $ V.fromList $ BL.split (w8 '\n') bstr
      m = (V.fromList . BL.split (w8 ',')) <$> rows -- n^2
  -- reorder
      m' = (flip V.backpermute) cols <$> m
  -- build back to bytestring
      numRows = length m'
      numCols = length cols
      builderM = mconcat' . V.imap (\i v -> BB.lazyByteString v <> (if i < numCols - 1 then "," else "")) <$> m'
      builderM' = mconcat' . V.imap (\i v -> v <> (if i < numRows - 1 then "\n" else "")) $ builderM
  in builderM'

main :: IO ()
main = do
  args <- parseArgs <$> getArgs

  withFile (filePath args) ReadMode $ \h -> do
    csvData <- BL.hGetContents h
    BB.hPutBuilder stdout $ reorder args csvData

该程序被调用如:$ reorder 2,1 x.csv表示为该csv的所有行提供第二列,然后是第一列,因此您可以忽略参数解析位。

1 个答案:

答案 0 :(得分:1)

我觉得你工作太辛苦了。手动构建和转换所有这些数据容易出错且难以推理(至少对我而言)。 cassava是为此类任务而制定的。

我无法从您提供的代码中完全展开数据结构,因此我将使用一个简单的示例来演示如何实现目标&#34;重新排序这样的专栏&#34;。

假设您有一张描述人员及其年龄列表的CSV。

{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE DeriveGeneric #-}

import Data.Text
import Data.Csv
import Data.Vector

data Person = Person { name :: !Text , age :: !Int } deriving (Generic, Show)

-- We want to read and write TSVs

decodeOpt :: DecodeOptions
decodeOpt = defaultDecodeOptions { decDelimiter = fromIntegral (ord '\t') }

encodeOpt :: EncodeOptions
encodeOpt = defaultEncodeOptions { encDelimiter = fromIntegral (ord '\t') }

-- NB: Ideally, your encode and decode should be inverses, but these aren't
dec :: FromRecord a => HasHeader -> ByteString -> Either String (Vector a)
dec = decodeWith decodeOpt

enc :: ToRecord a => [a] -> ByteString
enc = encodeWith encodeOpt

现在我们要做一些魔术:

instance FromRecord Person

instance ToRecord Person where
  toRecord (Person name age) = record [ toField age, toField name ]

现在我们可以采取

dec NoHeader "Roy\t30\r\nJim\t32" :: Either String (Vector Person)

并获取

Right [Person {name = "Roy", age = 30}
      ,Person {name = "Jim", age = 32}]

然后将它们重新序列化

enc [Person "Roy" 30, Person "Jim" 32]

我们的结果

"30\tRoy\r\n32\tJim\r\n"

因此,假设您对基于索引的列操作感兴趣,这一切都很好。如果您的CSV具有列名,则可以更直接地了解事物。

instance ToNamedRecord Person
instance DefaultOrdered Person
instance FromNamedRecord Person

-- NB: Ideally, your encode and decode should be inverses, but these aren't
decName ::FromNamedRecord a => ByteString -> Either String (Header, Vector a)
decName = decodeByNameWith decodeOpt

encName :: ToNamedRecord a => [a] -> ByteString
encName = encodeByNameWith encodeOpt (header ["age", "name"])

现在我们可以做到这一点

encName [Person "Roy" 30, Person "Jim" 32]

并获取

"age\tname\r\n30\tRoy\r\n32\tJim\r\n"

decName "name\tage\r\nRoy\t30\r\nJim\t32" :: Either String (Header, Vector Person)

获取

Right ( ["name","age"]
      , [Person { name = "Roy", age = 30 }
      , Person { name = "Jim", age = 32 }] )

最后,如果你真的不想要任何结构,cassava也可以解决这个问题。

dec NoHeader "Roy\t30\r\nJim\t32\r\n" :: Either String (Vector (Vector ByteString))

这给了我们

Right [["Roy","30"],["Jim","32"]]

enc [["Roy","30"],["Jim","32"]]

给我们

"Roy\t30\r\nJim\t32\r\n"

在这种情况下,它们只是常规列表,因此您可以根据需要对子列表进行任何转换,以重新排列列。