Question

我正在使用dbpedia转储开发分类工具，但遇到性能问题（处理多天）。我想对每篇维基百科文章进行分类-简而言之，我正在将一篇文章与相似的文章合并，并合并多个分类器的输出。我想利用我的PC（i7-6700K）在多个内核/进程上运行该程序，但似乎无法正常运行。我结束了多个进程，但一次只运行一个。

我正在和Windows子系统一起运行ubuntu。

task manager

输入文件如下：

# started 2016-06-16T01:23:53Z
<http://dbpedia.org/resource/Achilles> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Thing> .
<http://dbpedia.org/resource/An_American_in_Paris> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Thing> .
<http://dbpedia.org/resource/Actrius> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Film> .
<http://dbpedia.org/resource/Animalia_(book)> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Book> .
<http://dbpedia.org/resource/Agricultural_science> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Thing> .
<http://dbpedia.org/resource/Alain_Connes> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Scientist> .
<http://dbpedia.org/resource/Allan_Dwan> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Person> .

这是我的代码：

def processWrapper(self, resourceInput, chunkStart, chunkSize):
    results = []
    with open(resourceInput, 'rb') as f:
        f.seek(chunkStart)
        lines = f.read(chunkSize).splitlines()
        logger.info(lines)
        for line in lines:
            if line.strip().startswith(b"#"):
                continue

            x = line.strip().split(b" ")
            logger.info(line.strip())
            resource = x[0][1:-1].decode("utf-8")
            if regSearch(r"__\d+", resource):
                continue
            logger.info(resource)
            result = self.resultTresholding(self.aggregateClassfierResults(resource), self.tresholdStrategy, self.threshold)
            finalType = self.finalTypeSelection(self.finalTypeSelectionStrategy, self.tresholdStrategy, result)
            results.append(tuple(result, finalType))
        return results

def chunkify(self, fname, size=1024*1024):
    fileEnd = os.path.getsize(fname)
    print(fileEnd)
    with open(fname, 'rb') as f:
        chunkEnd = f.tell()
        while True:
            chunkStart = chunkEnd
            f.seek(size, 1)
            f.readline()
            chunkEnd = f.tell()
            yield chunkStart, chunkEnd - chunkStart
            if chunkEnd > fileEnd:
                break

def processResources(self, resourceInput):
    '''

    '''
    pool = mp.Pool(4)
    # graph = rdflib.Graph()

    jobs = []

    # create jobs
    for chunkStart, chunkSize in self.chunkify(str(resourceInput), 100):
        logger.info(f"chunkStart - {chunkStart}")
        logger.info(f"chunkSize - {chunkSize}")
        jobs.append(pool.apply_async(self.processWrapper, (resourceInput, chunkStart, chunkSize)))

    for job in jobs:
        for result in job.get():
            with open(s.resultFile, 'a') as g:
                g.write(f"<{result[0]}> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <{result[1]}>  .\n")
    pool.close()

我想念什么？这是我第一次使用multiprocessing ...

进程不并行

0 个答案: