I am using Gensim's LDAMulticore to perform LDA. I have around 28M small documents (around 100 characters each).
I have given workers argument to be 20 but the top shows it using only 4 processes. There are some discussions around it that it might be slow in reading corpus like: gensim LdaMulticore not multiprocessing? https://github.com/piskvorky/gensim/issues/288
But both of them uses MmCorpus . Although my corpus is completely in memory. I have machine with very large RAM (250 GB) and loading the corpus in memory takes around 40 GB. But even after that LDAMulticore is using just 4 processes. I created the corpus as:
public SkillDTO(Skill skill) {
idSkill = skill.getIdSkill();
name = skill.getName();
levelBezeichnung = skill.getLevelBezeichnung().getBezeichnung();
checked = skill.isChecked();
if (skill.getSkills().size() > 0) {
Iterator<Skill> iteratorSkill = skill.getSkills().iterator();
while (iteratorSkill.hasNext()) {
Skill tempSkill = iteratorSkill.next();
skills.add(convertSkillsToProfileDTO(tempSkill));
}
}
}
private SkillDTO convertSkillsToProfileDTO(Skill skill) {
return new SkillDTO(skill);
}
I am not able to understand what can be the limiting factor here?
答案 0 :(得分:0)
我会检查您使用的批量大小
我发现,如果批X n_workers 比文档数量大,我将无法利用我现有的所有可用工人。 这是有道理的,因为您每遍都为每个工作人员提供了许多文档。如果不考虑批次值,您可能会“饿死”其中一些。
我不确定它是否可以解决您的特定问题,但这确实是许多人提到多核无法按预期进行“工作”的原因