我正在使用以下代码在R中运行Latent Dirichlet主题模型:
from StringIO import StringIO
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter as PI
from pdfminer.pdfdevice import PDFDevice
from pdfminer.converter import TextConverter
import utils
__all__ = ['PDF']
class PDFPageInterpreter(PI):
def process_page(self, page):
if 1 <= self.debug:
print >>stderr, 'Processing page: %r' % page
(x0,y0,x1,y1) = page.mediabox
if page.rotate == 90:
ctm = (0,-1,1,0, -y0,x1)
elif page.rotate == 180:
ctm = (-1,0,0,-1, x1,y1)
elif page.rotate == 270:
ctm = (0,1,-1,0, y1,-x0)
else:
ctm = (1,0,0,1, -x0,-y0)
self.device.outfp.seek(0)
self.device.outfp.buf = ''
self.device.begin_page(page, ctm)
self.render_contents(page.resources, page.contents, ctm=ctm)
self.device.end_page(page)
return self.device.outfp.getvalue()
class PDF(list):
def __init__(self, file, password='', just_text=1):
self.parser = PDFParser(file)
self.doc = PDFDocument()
self.parser.set_document(self.doc)
self.doc.set_parser(self.parser)
self.doc.initialize(password)
if self.doc.is_extractable:
self.resmgr = PDFResourceManager()
self.device = TextConverter(self.resmgr, outfp=StringIO())
self.interpreter = PDFPageInterpreter(
self.resmgr, self.device)
for page in self.doc.get_pages():
self.append(self.interpreter.process_page(page))
self.metadata = self.doc.info
if just_text:
self._cleanup()
def _cleanup(self):
"""
Frees lots of non-textual information, such as the fonts
and images and the objects that were needed to parse the
PDF.
"""
del self.device
del self.doc
del self.parser
del self.resmgr
del self.interpreter
def text(self, clean=True):
"""
Returns the text of the PDF as a single string.
Options:
:clean:
Removes misc cruft, like lots of whitespace.
"""
if clean:
return ''.join(utils.trim_whitespace(page) for page in self)
else:
return ''.join(self)
dtm有1200万个元素,每个循环平均需要花费两个小时。同时,R仅使用我的8个逻辑处理器中的1个(我有i7-2700K CPU @ 3.50GHz,具有4个内核)。当我运行一个LDA主题模型或使用循环时(如此代码所示),如何使R使用所有可用的计算能力?
谢谢
编辑:按照gc_的建议,我使用了以下代码:
for(k in 2:30) {
ldaOut <-LDA(dtm,k, method="Gibbs",
control=list(nstart=nstart, seed = seed, best=best,
burnin = burnin, iter = iter, thin=thin))
assign(paste("ldaOut", k, sep = "_"), ldaOut)
}
代码运行没有错误,但是现在有16个“ R for Windows前端”进程,其中15个使用0%的CPU,1个使用16-17%...并且当进程结束时我收到此消息:
library(doParallel)
n.cores <- detectCores(all.tests = T, logical = T)
cl <- makePSOCKcluster(n.cores)
doParallel::registerDoParallel(cl)
burnin <- 4000
iter <- 2000
thin <- 500
seed <-list(2003,10,100,10005,765)
nstart <- 5
best <- TRUE
var.shared <- c("ldaOut", "dtm", "nstart", "seed", "best", "burnin", "iter", "thin", "n.cores")
library.shared <- "topicmodels" # Same for library or functions.
ldaOut <- c()
foreach (k = 2:(30 / n.cores - 1), .export = var.shared, .packages = library.shared) %dopar% {
ret <- LDA(dtm, k*n.cores , method="Gibbs",
control=list(nstart=nstart, seed = seed, best=best,
burnin = burnin, iter = iter, thin=thin))
assign(paste("ldaOut", k*n.cores, sep = "_"), ret)
}
答案 0 :(得分:1)
您可以使用库 doParallel
library(doParallel)
要获取计算机的内核数:
n.cores <- detectCores(all.tests = T, logical = T)
您可以看到逻辑核与物理核之间的区别。
现在,您需要分配核心并设置所有过程:
cl <- makePSOCKcluster(n.cores)
doParallel::registerDoParallel(cl)
您创建的进程数量可能超过计算机上的核心数量。 在R创建新流程时,您需要定义与工作人员共享的库和变量。
var.shared <- c("ldaOut", "dtm", "nstart", "seed", "best", "burnin", "iter", "thin", "n.cores")
library.shared <- c() # Same for library or functions.
然后循环将更改为:
ldaOut <- #Init the output#
foreach (k = 2:(30 / n.cores - 1), .export = var.shared, .packages = library.shared)) %dopar% {
ret <- LDA(dtm, k*n.cores , method="Gibbs",
control=list(nstart=nstart, seed = seed, best=best,
burnin = burnin, iter = iter, thin=thin))
assign(paste("ldaOut", k*n.cores, sep = "_"), ret)
}
我以前从未使用过LDA,因此您可能需要修改上面的代码以使其起作用。
答案 1 :(得分:0)
我认为lda很难并行执行,因为每次扫描都使用前一次扫描的结果。
所以要加快速度,您可以imo
- reduce your dtm
- use faster libraries e.g. vowpal wabbit
- use faster hardware e.g. aws
如果您针对alpha,eta,burnin等“超参数”进行优化,则可以在每个内核上使用不同的超参数运行完整的lda。