减少Haskell程序的内存使用量

时间:2017-01-28 00:29:10

标签: haskell streaming aeson

我在Haskell中有以下程序:

processDate :: String -> IO ()
processDate date = do
    ...
    let newFlattenedPropertiesWithPrice = filter (notYetInserted date existingProperties) flattenedPropertiesWithPrice
    geocodedProperties <- propertiesWithGeocoding newFlattenedPropertiesWithPrice

propertiesWithGeocoding :: [ParsedProperty] -> IO [(ParsedProperty, Maybe LatLng)]
propertiesWithGeocoding properties = do
    let addresses = fmap location properties
    let batchAddresses = chunksOf 100 addresses
    batchGeocodedLocations <- mapM geocodeAddresses batchAddresses
    let geocodedLocations = fromJust $ concat <$> sequence batchGeocodedLocations
    return (zip properties geocodedLocations)

geocodeAddresses :: [String] -> IO (Maybe [Maybe LatLng])
geocodeAddresses addresses = do
    mapQuestKey <- getEnv "MAP_QUEST_KEY"
    geocodeResponse <- openURL $ mapQuestUrl mapQuestKey addresses
    return $ geocodeResponseToResults geocodeResponse

geocodeResponseToResults :: String -> Maybe [Maybe LatLng]
geocodeResponseToResults inputResponse =
    latLangs
    where
        decodedResponse :: Maybe GeocodingResponse
        decodedResponse = decodeGeocodingResponse inputResponse

        latLangs = fmap (fmap geocodingResultToLatLng . results) decodedResponse

decodeGeocodingResponse :: String -> Maybe GeocodingResponse
decodeGeocodingResponse inputResponse = Data.Aeson.decode (fromString inputResponse) :: Maybe GeocodingResponse  

它从html文件中读取属性列表(住宅和公寓),解析它们,对地址进行地理编码并将结果保存到sqlite db中。
除了非常高的内存使用率(大约800M)外,一切正常 通过评论代码我已经确定问题是地理编码步骤 我一次向MapQuest api(https://developer.mapquest.com/documentation/geocoding-api/batch/get/)发送100个地址 100个地址的响应相当大,所以它可能是罪魁祸首之一,但800M?我觉得它坚持所有的结果,直到最终驱动内存使用率如此之高。

在评论出程序的地理编码部分后,内存使用量大约为30M,这很好。

您可以在此处获取完整版本以重现此问题:https://github.com/Leonti/haskell-memory-so

enter image description here

我是Haskell的新手,所以不确定如何优化它 有任何想法吗?

干杯!

1 个答案:

答案 0 :(得分:1)

值得记录的是,使用mapMsequence产生了simple streaming problem,其中包含replicateMtraverse以及让你&#34;从IO&#34;中提取列表的其他东西。总是提高积累的担忧。因此需要通过流媒体库稍微绕道而行。所以在回购中,只需要替换

processDate :: String -> IO ()
processDate date = do
    allFiles <- listFiles date
    allProperties <- mapM fileToProperties allFiles
    let flattenedPropertiesWithPrice = filter hasPrice $ concat allProperties
    geocodedProperties <- propertiesWithGeocoding flattenedPropertiesWithPrice
    print geocodedProperties

propertiesWithGeocoding :: [ParsedProperty] -> IO [(ParsedProperty, Maybe LatLng)]
propertiesWithGeocoding properties = do
    let batchProperties = chunksOf 100 properties
    batchGeocodedLocations <- mapM geocodeAddresses batchProperties
    let geocodedLocations = fromJust $ concat <$> sequence batchGeocodedLocations
    return geocodedLocations

有这样的东西

import Streaming
import qualified Streaming.Prelude as S

processDate :: String -> IO ()
processDate date = do
    allFiles <- listFiles date   -- we accept an unstreamed list
    S.print $ propertiesWithGeocoding -- this was the main pain point see below
            $ S.filter hasPrice 
            $ S.concat 
            $ S.mapM fileToProperties -- this mapM doesn't accumulate
            $ S.each allFiles    -- the list is converted to a stream

propertiesWithGeocoding
  :: Stream (Of ParsedProperty) IO r
     -> Stream (Of (ParsedProperty, Maybe LatLng)) IO r
propertiesWithGeocoding properties =  
    S.concat $ S.concat 
             $ S.mapM geocodeAddresses -- this mapM doesn't accumulate results from mapquest
             $ S.mapped S.toList       -- convert segments to haskell lists
             $ chunksOf 100 properties -- this is the streaming `chunksOf`
    -- concat here flattens a stream of lists of as into a stream of as
    -- and a stream of maybe as into a stream of as

然后内存使用看起来像这样,每个峰对应一次Mapquest之后,接着是一点点处理和打印,然后ghc忘记了所有关于它并继续前进:

当然,这可以通过pipesconduit来完成。但在这里我们只需要一点点简单mapM / sequence / traverse / replicateM避免,streaming对于这种快速的局部重构可能是最简单的。请注意,此列表非常简短,因此我可以考虑使用mapM / traverse / etc!&#34;可能非常虚伪。为什么不摆脱它们呢?每当您打算编写列表mapM f时,最好考虑S.mapM f . S.each(或管道或管道等效)。您现在将拥有一个流,可以使用S.toList或等效内容恢复列表,但很可能,在这种情况下,您可能会发现您不需要已确定的累积列表,但可以例如在制作任何需要操作的列表之后(例如,我们使用例如.stream filterconcat来平展流列表和一种catMaybe)。