Question

这是我关于SO的第一个问题，请耐心等待。我在这里和那里做了一些搜索但是找不到我的问题的解决方案，尽管有很多关于多处理和内存的主题。我的程序使用与数据库中的另一个对象连接的sqlalchemy对象（文章有一些实体等），在一步中我需要对每对文章进行评分，这部分具有巨大的计算成本。我想使用多处理来加快速度（1000篇文章需要210分钟才能完成）。原始单进程版本消耗cca 750 MB内存并坚持到这个数量，在分成4个进程后，每个进程的内存随时间增长并达到1.5 GB以上。什么可以导致这样的内存消耗？我尝试在从DB加载文章之前生成进程，同时将较少的数据传递给映射的函数但结果很少。函数内的所有数据都不会在函数内部改变。

有问题的代码：

pool = multiprocessing.Pool()
# loading articles into self.articles, this becomes dictionary with article ids as keys
# doing some more computing, not important AFAIK
scores = pool.map(articlePairScoreWrapper, [(article1,  # instance of sqlalchemy mapped class containing references to another DB records
    article2, 
    article1.rssRecord.date,  # datetime object
    article2.rssRecord.date, 
    self.dateThreshold,  # small integer
    self.tfIdfCorpus[self.dbGensimMapper[article1.id]],  # this is quite short dictionary
    self.tfIdfCorpus[self.dbGensimMapper[article2.id]], 
    self.quotationsArt[article1.id],  # short list of strings, often empty
    self.quotationsArt[article2.id], 
    self.weights  # dictionary with 3 small items
    ) for article1, article2 in itertools.combinations(self.articles.values(), 2)])

articlePairScoreWrapper只返回articlePairScore（* args）

的值

我希望，我没有忘记任何事情，也不是太冗长，谢谢你的回复

编辑：

ps -l -y运行时输出，由于交换太多而不得不立即杀死它

S   UID   PID  PPID  C PRI  NI   RSS    SZ WCHAN  TTY          TIME CMD
S  1000 16723 11247 59  80   0 670736 322767 futex_ pts/13 00:01:48 Article_cluster
S  1000 16734 16723  5  80   0 1414648 396561 futex_ pts/13 00:00:09 Article_cluster
S  1000 16735 16723  5  80   0 1310716 377368 futex_ pts/13 00:00:09 Article_cluster
S  1000 16736 16723  5  80   0 1327292 374276 futex_ pts/13 00:00:09 Article_cluster
D  1000 16737 16723  2  80   0 965584 287491 sleep_ pts/13 00:00:05 Article_cluster

尽管如此，我设法优化了原始功能，并且在没有多处理的情况下工作得很好，所以不需要解决这个问题

使用多处理和sqlalchemy对象时，内存使用量迅速增长

0 个答案: