我在Haskell中有以下程序:
processDate :: String -> IO ()
processDate date = do
...
let newFlattenedPropertiesWithPrice = filter (notYetInserted date existingProperties) flattenedPropertiesWithPrice
geocodedProperties <- propertiesWithGeocoding newFlattenedPropertiesWithPrice
propertiesWithGeocoding :: [ParsedProperty] -> IO [(ParsedProperty, Maybe LatLng)]
propertiesWithGeocoding properties = do
let addresses = fmap location properties
let batchAddresses = chunksOf 100 addresses
batchGeocodedLocations <- mapM geocodeAddresses batchAddresses
let geocodedLocations = fromJust $ concat <$> sequence batchGeocodedLocations
return (zip properties geocodedLocations)
geocodeAddresses :: [String] -> IO (Maybe [Maybe LatLng])
geocodeAddresses addresses = do
mapQuestKey <- getEnv "MAP_QUEST_KEY"
geocodeResponse <- openURL $ mapQuestUrl mapQuestKey addresses
return $ geocodeResponseToResults geocodeResponse
geocodeResponseToResults :: String -> Maybe [Maybe LatLng]
geocodeResponseToResults inputResponse =
latLangs
where
decodedResponse :: Maybe GeocodingResponse
decodedResponse = decodeGeocodingResponse inputResponse
latLangs = fmap (fmap geocodingResultToLatLng . results) decodedResponse
decodeGeocodingResponse :: String -> Maybe GeocodingResponse
decodeGeocodingResponse inputResponse = Data.Aeson.decode (fromString inputResponse) :: Maybe GeocodingResponse
它从html文件中读取属性列表(住宅和公寓),解析它们,对地址进行地理编码并将结果保存到sqlite db中。
除了非常高的内存使用率(大约800M)外,一切正常
通过评论代码我已经确定问题是地理编码步骤
我一次向MapQuest api(https://developer.mapquest.com/documentation/geocoding-api/batch/get/)发送100个地址
100个地址的响应相当大,所以它可能是罪魁祸首之一,但800M?我觉得它坚持所有的结果,直到最终驱动内存使用率如此之高。
在评论出程序的地理编码部分后,内存使用量大约为30M,这很好。
您可以在此处获取完整版本以重现此问题:https://github.com/Leonti/haskell-memory-so
我是Haskell的新手,所以不确定如何优化它 有任何想法吗?
干杯!
答案 0 :(得分:1)
值得记录的是,使用mapM
和sequence
产生了simple streaming problem,其中包含replicateM
和traverse
以及让你&#34;从IO&#34;中提取列表的其他东西。总是提高积累的担忧。因此需要通过流媒体库稍微绕道而行。所以在回购中,只需要替换
processDate :: String -> IO ()
processDate date = do
allFiles <- listFiles date
allProperties <- mapM fileToProperties allFiles
let flattenedPropertiesWithPrice = filter hasPrice $ concat allProperties
geocodedProperties <- propertiesWithGeocoding flattenedPropertiesWithPrice
print geocodedProperties
propertiesWithGeocoding :: [ParsedProperty] -> IO [(ParsedProperty, Maybe LatLng)]
propertiesWithGeocoding properties = do
let batchProperties = chunksOf 100 properties
batchGeocodedLocations <- mapM geocodeAddresses batchProperties
let geocodedLocations = fromJust $ concat <$> sequence batchGeocodedLocations
return geocodedLocations
有这样的东西
import Streaming
import qualified Streaming.Prelude as S
processDate :: String -> IO ()
processDate date = do
allFiles <- listFiles date -- we accept an unstreamed list
S.print $ propertiesWithGeocoding -- this was the main pain point see below
$ S.filter hasPrice
$ S.concat
$ S.mapM fileToProperties -- this mapM doesn't accumulate
$ S.each allFiles -- the list is converted to a stream
propertiesWithGeocoding
:: Stream (Of ParsedProperty) IO r
-> Stream (Of (ParsedProperty, Maybe LatLng)) IO r
propertiesWithGeocoding properties =
S.concat $ S.concat
$ S.mapM geocodeAddresses -- this mapM doesn't accumulate results from mapquest
$ S.mapped S.toList -- convert segments to haskell lists
$ chunksOf 100 properties -- this is the streaming `chunksOf`
-- concat here flattens a stream of lists of as into a stream of as
-- and a stream of maybe as into a stream of as
然后内存使用看起来像这样,每个峰对应一次Mapquest之后,接着是一点点处理和打印,然后ghc
忘记了所有关于它并继续前进:
当然,这可以通过pipes
或conduit
来完成。但在这里我们只需要一点点简单mapM
/ sequence
/ traverse
/ replicateM
避免,streaming
对于这种快速的局部重构可能是最简单的。请注意,此列表非常简短,因此我可以考虑使用mapM
/ traverse
/ etc!&#34;可能非常虚伪。为什么不摆脱它们呢?每当您打算编写列表mapM f
时,最好考虑S.mapM f . S.each
(或管道或管道等效)。您现在将拥有一个流,可以使用S.toList
或等效内容恢复列表,但很可能,在这种情况下,您可能会发现您不需要已确定的累积列表,但可以例如在制作任何需要操作的列表之后(例如,我们使用例如.stream filter
和concat
来平展流列表和一种catMaybe
)。