从带有Haskell标头的CSV中读取特定行

时间:2016-08-21 19:03:33

标签: csv haskell dataframe

我正在学习haskell,并尝试制作一个简单的程序,它给出了一个元组列表

(header, data)

阅读CSV时。我试图使用Data.Text.LazyData.Text.Lazio.IO,因为我理解 与String相比,它们具有良好的性能和unicode覆盖率。

我正在处理的功能将采用行号(n)和CSV文件名(filename)并仅返回(header, datum)元组

这是我的CSV," dat.csv"

ORDINAL,CATEGORICAL,BOOL,CONTINUOUS,INT
Low,Blue,True,1.2,2
Medium,Green,False,0.5,3
High, Green,False,1.0,5

这是我的代码:

-- hs_reader.hs
{-# LANGUAGE OverloadedStrings #-}
import Data.Text.Lazy as T
import Data.Text.Lazy.IO as I
import Control.Applicative

getL :: Int -> FilePath -> IO [(Text,Text)]
getL n filename =
  do
    flines <- T.lines <$> I.readFile filename
    let headers = Prelude.head flines
    let body = Prelude.tail flines
    let row = Prelude.zip (splitOn "," headers) (splitOn "," (body !! n))
    return row

这就像我想要的那样:

Prelude> :l hs_reader
[1 of 1] Compiling Main             ( hs_reader.hs, interpreted )
Ok, modules loaded: Main.
Prelude> getL 1 "dat.csv"
[("ORDINAL","Medium"),("CATEGORICAL","Green"),("BOOL","False"),("CONTINUOUS","0.5"),("INT","3")]
Prelude> getL 2 "dat.csv"
[("ORDINAL","High"),("CATEGORICAL"," Green"),("BOOL","False"),("CONTINUOUS","1.0"),("INT","5")]

我意识到我对如何正确使用monad了解不多。我有4个主要问题:

问题(1)我想对一系列行号进行部分功能应用。为什么这不起作用?

let readF x = getL x "dat.csv"
-- a
Prelude.map readF [1..3]
--
Prelude> Prelude.map readF [1..3]
--
<interactive>:514:1:
    No instance for (Show (IO [(Text, Text)]))
      arising from a use of ‘print’
    In a stmt of an interactive GHCi command: print it
--
-- b.
Prelude> T.map readF [1..3]
--
<interactive>:515:7:
    Couldn't match type ‘IO [(Text, Text)]’ with ‘Char’
    Expected type: Char -> Char
      Actual type: Int -> IO [(Text, Text)]
    In the first argument of ‘T.map’, namely ‘readF’
    In the expression: T.map readF [1 .. 3]
--
<interactive>:515:13:
    Couldn't match expected type ‘Text’ with actual type ‘[Integer]’
    In the second argument of ‘T.map’, namely ‘[1 .. 3]’
    In the expression: T.map readF [1 .. 3]
    In an equation for ‘it’: it = T.map readF [1 .. 3]

问题(2)有更优雅的方法吗?我可以在没有任何let语句的情况下这样做,因为我有一个吗?

问题(3)我试图使用以下内容,因为它看起来更像我在网上看到的例子。为什么这不起作用? (我不能使用&#34;&lt; - &#34;在哪里?)

getL2 :: Int -> FilePath ->  [(Text,Text)]
getL2 n filename = do
  Prelude.zip (splitOn "," headers) (splitOn "," (body !! n))
  where 
    headers = Prelude.head flines
    body = Prelude.tail flines
    flines <- T.lines <$> I.readFile filename
--
-- ERROR!
hs_reader.hs:25:12:
    parse error on input ‘<-’
    Perhaps this statement should be within a 'do' block?
Failed, modules loaded: none.

问题(4)我和一些单子一起工作。这些家伙中的一个是否适用于易于理解的方式? &gt;&gt; =或&gt; =&gt; ?

1 个答案:

答案 0 :(得分:2)

(1)获取[IO [(Text,Text)]],因为您映射了Int - &gt; IO [(Text,Text)]在[Int]上。你想要mapM。

(2)!!是一种气味。我会得到立即制作一个完整的清单,如果你真的想提供Ints之后你仍然可以使用!!在通话现场:

flines <- T.lines <$> I.readFile filename

(3)>>=是monadic绑定,你不能只在where子句中这样做,你可以在do块中执行它的唯一原因是因为那些被置于readCSV :: FilePath -> IO [[(Text,Text)]] readCSV filename = (T.lines <$> I.readFile filename) >>= \(headers : body) -> return $ map (Prelude.zip (splitOn "," headers) . splitOn ",") body

(4)这就是看起来像是什么样的:

>=>

由于文件名仅在第一行末尾使用一次,因此实际上可以使用readCSV :: FilePath -> IO [[(Text,Text)]] readCSV = fmap T.lines . I.readFile >=> \(headers : body) -> return $ map (Prelude.zip (splitOn "," headers) . splitOn ",") body 编写:

>=>

由于最后一行仅使用了返回,因此我们甚至不需要fmap - readCSV :: FilePath -> IO [[(Text,Text)]] readCSV = fmap ( (\(headers : body) -> map (Prelude.zip (splitOn "," headers) . splitOn ",") body) . T.lines) . I.readFile 就足够了。

parseCSV :: Text -> [[(Text, Text)]]
parseCSV = 
  (\(headers : body) -> map (Prelude.zip (splitOn "," headers) . splitOn ",") body)
  . T.lines

其中哪一个更具可读性,当然完全是另一个问题。

编辑:最后一个建议进一步重构:

main :: IO ()
main = do
  [filename, field] <- getArgs
  csv <- parseCSV <$> I.readFile filename
  print $ traverse (lookup field) csv

然后你就像使用它一样:

d = xmltodict.parse(s, force_list={'car'})