我正在使用dbpedia转储开发分类工具,但遇到性能问题(处理多天)。我想对每篇维基百科文章进行分类-简而言之,我正在将一篇文章与相似的文章合并,并合并多个分类器的输出。我想利用我的PC(i7-6700K)在多个内核/进程上运行该程序,但似乎无法正常运行。我结束了多个进程,但一次只运行一个。
我正在和Windows子系统一起运行ubuntu。
输入文件如下:
# started 2016-06-16T01:23:53Z
<http://dbpedia.org/resource/Achilles> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Thing> .
<http://dbpedia.org/resource/An_American_in_Paris> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Thing> .
<http://dbpedia.org/resource/Actrius> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Film> .
<http://dbpedia.org/resource/Animalia_(book)> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Book> .
<http://dbpedia.org/resource/Agricultural_science> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Thing> .
<http://dbpedia.org/resource/Alain_Connes> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Scientist> .
<http://dbpedia.org/resource/Allan_Dwan> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Person> .
这是我的代码:
def processWrapper(self, resourceInput, chunkStart, chunkSize):
results = []
with open(resourceInput, 'rb') as f:
f.seek(chunkStart)
lines = f.read(chunkSize).splitlines()
logger.info(lines)
for line in lines:
if line.strip().startswith(b"#"):
continue
x = line.strip().split(b" ")
logger.info(line.strip())
resource = x[0][1:-1].decode("utf-8")
if regSearch(r"__\d+", resource):
continue
logger.info(resource)
result = self.resultTresholding(self.aggregateClassfierResults(resource), self.tresholdStrategy, self.threshold)
finalType = self.finalTypeSelection(self.finalTypeSelectionStrategy, self.tresholdStrategy, result)
results.append(tuple(result, finalType))
return results
def chunkify(self, fname, size=1024*1024):
fileEnd = os.path.getsize(fname)
print(fileEnd)
with open(fname, 'rb') as f:
chunkEnd = f.tell()
while True:
chunkStart = chunkEnd
f.seek(size, 1)
f.readline()
chunkEnd = f.tell()
yield chunkStart, chunkEnd - chunkStart
if chunkEnd > fileEnd:
break
def processResources(self, resourceInput):
'''
'''
pool = mp.Pool(4)
# graph = rdflib.Graph()
jobs = []
# create jobs
for chunkStart, chunkSize in self.chunkify(str(resourceInput), 100):
logger.info(f"chunkStart - {chunkStart}")
logger.info(f"chunkSize - {chunkSize}")
jobs.append(pool.apply_async(self.processWrapper, (resourceInput, chunkStart, chunkSize)))
for job in jobs:
for result in job.get():
with open(s.resultFile, 'a') as g:
g.write(f"<{result[0]}> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <{result[1]}> .\n")
pool.close()
我想念什么?这是我第一次使用multiprocessing
...