我尝试解决这种学术用例:解析一个csv文件,并使用多线程将数据插入数据库中。
在Java中,我编写了一个解决方案,该解决方案使用主线程将数据读取到集合中,然后同时运行8个任务以将数据插入数据库中。使用我的8核计算机,它可以在60秒钟内完成100万行csv文件的工作(2列:标题和价格)。
然后,我尝试用我的初学者技能在haskell中写等效的内容:
{-# LANGUAGE OverloadedStrings #-}
import Data.Text
import qualified Data.Text.IO as TIO
import Text.Parsec
import Text.Parsec.Text (Parser)
import Database.PostgreSQL.Simple
import Data.Int
import Control.Concurrent.Async
line :: Parser (Text,Text)
line = do
title <- many $ noneOf ","
oneOf ","
price <- many $ digit
return (pack title,pack price)
file :: Parser [(Text,Text)]
file = line `endBy` newline
parseCsv :: SourceName -> Text -> Either ParseError [(Text,Text)]
parseCsv = parse file
parseCsvF :: FilePath -> IO (Either ParseError [(Text,Text)])
parseCsvF path = fmap (parseCsv path) $ TIO.readFile path
connectDB :: IO Connection
connectDB = connect (ConnectInfo { connectHost="localhost", connectPort=5432, connectUser="parser", connectPassword="parser", connectDatabase="parser"})
insertComic :: Connection -> (Text,Text) -> IO Int64
insertComic conn (title,price) = execute conn "INSERT INTO comics (title, price) VALUES (?,?)" [unpack title , unpack price]
main = do
conn <- connectDB
input <- parseCsvF "data.csv"
let (Right x) = input
inserts = Prelude.map (insertComic conn) x
asyncs = Prelude.map async inserts
waiters = Prelude.map waitForIt asyncs
sequence waiters
waitForIt :: IO (Async Int64) -> IO Int64
waitForIt x = x >>= \v -> wait v
ghc -threaded injector.hs -o injector
./injector +RTS -N8
不幸的是,它非常慢(几分钟...)
我想我没有正确使用Async。有人可以给我一个解决方案的例子,以使该程序有效地使用多线程吗?
答案 0 :(得分:2)
我在这里提出了一个解决方案,可能不是最好的解决方案,但是它比Java代码更容易实现更好的性能,而无需使用特定的多线程机制。 我使用资源池和10000行的块进行插入。
{-# LANGUAGE OverloadedStrings #-}
import Data.Text
import qualified Data.Text.IO as TIO
import Text.Parsec
import Text.Parsec.Text (Parser)
import Database.PostgreSQL.Simple
import Data.Int
import Control.Concurrent.Async
import Data.Pool
import qualified Data.List.Split as Split
import System.CPUTime
line :: Parser (Text,Text)
line = do
title <- many $ noneOf ","
oneOf ","
price <- many $ digit
return (pack title,pack price)
file :: Parser [(Text,Text)]
file = line `endBy` newline
parseCsv :: SourceName -> Text -> Either ParseError [(Text,Text)]
parseCsv = parse file
parseCsvF :: FilePath -> IO (Either ParseError [(Text,Text)])
parseCsvF path = fmap (parseCsv path) $ TIO.readFile path
connectionInfo :: ConnectInfo
connectionInfo = ConnectInfo {
connectHost="localhost",
connectPort=5432,
connectUser="parser",
connectPassword="parser",
connectDatabase="parser"}
myPool :: IO (Pool Connection)
myPool = createPool (connect connectionInfo) close 1 10 10
insertComic :: Pool Connection -> [(Text , Text)] -> IO Int64
insertComic pool comic = withResource pool (\conn -> insertComic' conn comic)
insertComic' :: Connection -> [(Text,Text)] -> IO Int64
insertComic' conn comics = executeMany conn "INSERT INTO comics (title, price) VALUES (?,?)" comics
main = do
start <- getCPUTime
pool <- myPool
input <- parseCsvF "data.csv"
let (Right allComics) = input
chunks = Split.chunksOf 10000 allComics
inserts = [ insertComic pool chunk | chunk <- chunks]
sequence inserts
end <- getCPUTime
putStrLn $ show $ fromIntegral (end-start) / 10^12
ghc -threaded injector.hs -o injector
./injector +RTS -N8
答案 1 :(得分:0)
您的代码存在问题,它启动了每一行的数据库事务。
我建议您将数据分成多个块,并在一个事务中处理整个块。
如果您在一条insert
语句中插入多条记录,也会有所帮助。
编辑
另一个问题(最大的问题)是仅使用一个连接,这使代码有效地顺序执行,而不是并行执行。
还可以在处理数据之前将所有数据读取到内存中。您也可以在此处提高性能。