我正在尝试找到重新排序csv文件列的最快方法(使用单元格中没有逗号的简单csv子集)。我正在通过Vector.backpermute进行重新排序,这很好; RTS -p指示的瓶颈是构建我执行此操作的向量向量。下面的代码是我能想到的最快的版本。有人有什么想法吗?
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Control.Applicative
import Control.Monad
import qualified Data.ByteString as B
import qualified Data.ByteString.Builder as BB
import qualified Data.ByteString.Lazy as BL
import qualified Data.ByteString.Lazy.Char8 as BL8
import Data.Char
import Data.Foldable
import Data.Monoid
import qualified Data.Vector as V
import Data.Word
import Debug.Trace
import System.Environment
import System.IO
data Args = Args { cols :: V.Vector Int, filePath :: FilePath } deriving (Show)
--
w8 = fromIntegral . ord
mconcat' :: (Foldable t, Monoid a) => t a -> a
mconcat' = foldl' (<>) mempty
parseArgs :: [String] -> Args
parseArgs [colStr, filePath] = Args ((\n -> n-1) . read <$> V.fromList (split ',' colStr)) filePath
where split :: Char -> String -> [String]
split d str = gosplit d str []
gosplit d "" acc = reverse acc
gosplit d str acc = gosplit d (drop 1 $ dropWhile (/= d) str) $ takeWhile (/= d) str : acc
reorder :: Args -> BL.ByteString -> BB.Builder
reorder (Args cols _ ) bstr =
-- transform to vec matrix
let rows = V.filter (not . BL.null) $ V.fromList $ BL.split (w8 '\n') bstr
m = (V.fromList . BL.split (w8 ',')) <$> rows -- n^2
-- reorder
m' = (flip V.backpermute) cols <$> m
-- build back to bytestring
numRows = length m'
numCols = length cols
builderM = mconcat' . V.imap (\i v -> BB.lazyByteString v <> (if i < numCols - 1 then "," else "")) <$> m'
builderM' = mconcat' . V.imap (\i v -> v <> (if i < numRows - 1 then "\n" else "")) $ builderM
in builderM'
main :: IO ()
main = do
args <- parseArgs <$> getArgs
withFile (filePath args) ReadMode $ \h -> do
csvData <- BL.hGetContents h
BB.hPutBuilder stdout $ reorder args csvData
该程序被调用如:$ reorder 2,1 x.csv
表示为该csv的所有行提供第二列,然后是第一列,因此您可以忽略参数解析位。
答案 0 :(得分:1)
我觉得你工作太辛苦了。手动构建和转换所有这些数据容易出错且难以推理(至少对我而言)。 cassava
是为此类任务而制定的。
我无法从您提供的代码中完全展开数据结构,因此我将使用一个简单的示例来演示如何实现目标&#34;重新排序这样的专栏&#34;。
假设您有一张描述人员及其年龄列表的CSV。
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE DeriveGeneric #-}
import Data.Text
import Data.Csv
import Data.Vector
data Person = Person { name :: !Text , age :: !Int } deriving (Generic, Show)
-- We want to read and write TSVs
decodeOpt :: DecodeOptions
decodeOpt = defaultDecodeOptions { decDelimiter = fromIntegral (ord '\t') }
encodeOpt :: EncodeOptions
encodeOpt = defaultEncodeOptions { encDelimiter = fromIntegral (ord '\t') }
-- NB: Ideally, your encode and decode should be inverses, but these aren't
dec :: FromRecord a => HasHeader -> ByteString -> Either String (Vector a)
dec = decodeWith decodeOpt
enc :: ToRecord a => [a] -> ByteString
enc = encodeWith encodeOpt
现在我们要做一些魔术:
instance FromRecord Person
instance ToRecord Person where
toRecord (Person name age) = record [ toField age, toField name ]
现在我们可以采取
dec NoHeader "Roy\t30\r\nJim\t32" :: Either String (Vector Person)
并获取
Right [Person {name = "Roy", age = 30}
,Person {name = "Jim", age = 32}]
然后将它们重新序列化
enc [Person "Roy" 30, Person "Jim" 32]
我们的结果
"30\tRoy\r\n32\tJim\r\n"
因此,假设您对基于索引的列操作感兴趣,这一切都很好。如果您的CSV具有列名,则可以更直接地了解事物。
instance ToNamedRecord Person
instance DefaultOrdered Person
instance FromNamedRecord Person
-- NB: Ideally, your encode and decode should be inverses, but these aren't
decName ::FromNamedRecord a => ByteString -> Either String (Header, Vector a)
decName = decodeByNameWith decodeOpt
encName :: ToNamedRecord a => [a] -> ByteString
encName = encodeByNameWith encodeOpt (header ["age", "name"])
现在我们可以做到这一点
encName [Person "Roy" 30, Person "Jim" 32]
并获取
"age\tname\r\n30\tRoy\r\n32\tJim\r\n"
或
decName "name\tage\r\nRoy\t30\r\nJim\t32" :: Either String (Header, Vector Person)
获取
Right ( ["name","age"]
, [Person { name = "Roy", age = 30 }
, Person { name = "Jim", age = 32 }] )
最后,如果你真的不想要任何结构,cassava
也可以解决这个问题。
dec NoHeader "Roy\t30\r\nJim\t32\r\n" :: Either String (Vector (Vector ByteString))
这给了我们
Right [["Roy","30"],["Jim","32"]]
和
enc [["Roy","30"],["Jim","32"]]
给我们
"Roy\t30\r\nJim\t32\r\n"
在这种情况下,它们只是常规列表,因此您可以根据需要对子列表进行任何转换,以重新排列列。