Gensim LdaMulticore is not multiprocessing properly (using just 4 workers)

时间:2016-02-12 22:01:00

标签: python lda gensim topic-modeling

I am using Gensim's LDAMulticore to perform LDA. I have around 28M small documents (around 100 characters each).

I have given workers argument to be 20 but the top shows it using only 4 processes. There are some discussions around it that it might be slow in reading corpus like: gensim LdaMulticore not multiprocessing? https://github.com/piskvorky/gensim/issues/288

But both of them uses MmCorpus . Although my corpus is completely in memory. I have machine with very large RAM (250 GB) and loading the corpus in memory takes around 40 GB. But even after that LDAMulticore is using just 4 processes. I created the corpus as:

public SkillDTO(Skill skill) { idSkill = skill.getIdSkill(); name = skill.getName(); levelBezeichnung = skill.getLevelBezeichnung().getBezeichnung(); checked = skill.isChecked(); if (skill.getSkills().size() > 0) { Iterator<Skill> iteratorSkill = skill.getSkills().iterator(); while (iteratorSkill.hasNext()) { Skill tempSkill = iteratorSkill.next(); skills.add(convertSkillsToProfileDTO(tempSkill)); } } } private SkillDTO convertSkillsToProfileDTO(Skill skill) { return new SkillDTO(skill); }

I am not able to understand what can be the limiting factor here?

1 个答案:

答案 0 :(得分:0)

我会检查您使用的批量大小

我发现,如果批X n_workers 文档数量大,我将无法利用我现有的所有可用工人。 这是有道理的,因为您每遍都为每个工作人员提供了许多文档。如果不考虑批次值,您可能会“饿死”其中一些。

我不确定它是否可以解决您的特定问题,但这确实是许多人提到多核无法按预期进行“工作”的原因