使用并发进行简单的csv导入

时间:2019-05-15 17:10:31

标签: haskell

我尝试解决这种学术用例:解析一个csv文件,并使用多线程将数据插入数据库中。

在Java中,我编写了一个解决方案,该解决方案使用主线程将数据读取到集合中,然后同时运行8个任务以将数据插入数据库中。使用我的8核计算机,它可以在60秒钟内完成100万行csv文件的工作(2列:标题和价格)。

然后,我尝试用我的初学者技能在haskell中写等效的内容:

{-# LANGUAGE OverloadedStrings #-}

import Data.Text
import qualified Data.Text.IO as TIO 
import Text.Parsec
import Text.Parsec.Text (Parser)
import Database.PostgreSQL.Simple
import Data.Int
import Control.Concurrent.Async

line :: Parser (Text,Text)
line = do
  title  <- many $ noneOf ","
  oneOf ","
  price <- many $ digit
  return (pack title,pack price)

file :: Parser [(Text,Text)]
file = line `endBy` newline

parseCsv :: SourceName -> Text -> Either ParseError [(Text,Text)]
parseCsv = parse file

parseCsvF :: FilePath -> IO (Either ParseError [(Text,Text)])
parseCsvF path = fmap (parseCsv path) $ TIO.readFile path 

connectDB :: IO Connection
connectDB = connect (ConnectInfo { connectHost="localhost", connectPort=5432, connectUser="parser", connectPassword="parser", connectDatabase="parser"}) 

insertComic :: Connection -> (Text,Text) -> IO Int64 
insertComic conn (title,price) = execute conn "INSERT INTO comics (title, price) VALUES (?,?)" [unpack title , unpack price]  

main = do
  conn <- connectDB
  input <- parseCsvF "data.csv"
  let (Right x) = input
      inserts = Prelude.map (insertComic conn) x
      asyncs = Prelude.map async inserts
      waiters = Prelude.map waitForIt asyncs
  sequence waiters


waitForIt :: IO (Async Int64) -> IO Int64
waitForIt x =  x >>= \v -> wait v

ghc -threaded injector.hs -o injector
./injector +RTS -N8

不幸的是,它非常慢(几分钟...)

我想我没有正确使用Async。有人可以给我一个解决方案的例子,以使该程序有效地使用多线程吗?

2 个答案:

答案 0 :(得分:2)

我在这里提出了一个解决方案,可能不是最好的解决方案,但是它比Java代码更容易实现更好的性能,而无需使用特定的多线程机制。 我使用资源池和10000行的块进行插入。

{-# LANGUAGE OverloadedStrings #-}

import Data.Text
import qualified Data.Text.IO as TIO 
import Text.Parsec
import Text.Parsec.Text (Parser)
import Database.PostgreSQL.Simple
import Data.Int
import Control.Concurrent.Async
import Data.Pool
import qualified Data.List.Split as Split
import System.CPUTime

line :: Parser (Text,Text)
line = do
  title  <- many $ noneOf ","
  oneOf ","
  price <- many $ digit
  return (pack title,pack price)

file :: Parser [(Text,Text)]
file = line `endBy` newline

parseCsv :: SourceName -> Text -> Either ParseError [(Text,Text)]
parseCsv = parse file

parseCsvF :: FilePath -> IO (Either ParseError [(Text,Text)])
parseCsvF path = fmap (parseCsv path) $ TIO.readFile path 

connectionInfo :: ConnectInfo
connectionInfo = ConnectInfo {
  connectHost="localhost",
  connectPort=5432,
  connectUser="parser",
  connectPassword="parser",
  connectDatabase="parser"}

myPool :: IO (Pool Connection)
myPool = createPool (connect connectionInfo) close 1 10 10

insertComic :: Pool Connection -> [(Text , Text)] -> IO Int64
insertComic pool comic = withResource pool (\conn -> insertComic' conn comic)

insertComic' :: Connection -> [(Text,Text)] -> IO Int64 
insertComic' conn comics = executeMany conn "INSERT INTO comics (title, price) VALUES (?,?)" comics  

main = do
  start <- getCPUTime
  pool <- myPool
  input <- parseCsvF "data.csv"
  let (Right allComics) = input
      chunks =  Split.chunksOf 10000 allComics
      inserts = [ insertComic pool chunk  | chunk <- chunks]
  sequence inserts
  end <- getCPUTime
  putStrLn $ show $ fromIntegral (end-start) / 10^12

ghc -threaded injector.hs -o injector
./injector +RTS -N8

答案 1 :(得分:0)

您的代码存在问题,它启动了每一行的数据库事务。

我建议您将数据分成多个块,并在一个事务中处理整个块。

如果您在一条insert语句中插入多条记录,也会有所帮助。

编辑

另一个问题(最大的问题)是仅使用一个连接,这使代码有效地顺序执行,而不是并行执行。

还可以在处理数据之前将所有数据读取到内存中。您也可以在此处提高性能。