我在一些相当大的数据集上使用NLTK执行自然语言处理,并希望利用我的所有处理器核心。似乎多处理模块就在我之后,当我运行以下测试代码时,我看到所有核心都在被利用,但代码永远不会完成。
执行相同的任务,无需多处理,大约在一分钟内完成。
debian上的Python 2.7.11。
from nltk.tokenize import word_tokenize
import io
import time
import multiprocessing as mp
def open_file(filepath):
#open and parse file
file = io.open(filepath, 'rU', encoding='utf-8')
text = file.read()
return text
def mp_word_tokenize(text_to_process):
#word tokenize
start_time = time.clock()
pool = mp.Pool(processes=8)
word_tokens = pool.map(word_tokenize, text_to_process)
finish_time = time.clock() - start_time
print 'Finished word_tokenize in [' + str(finish_time) + '] seconds. Generated [' + str(len(word_tokens)) + '] tokens'
return word_tokens
filepath = "./p40_compiled.txt"
text = open_file(filepath)
tokenized_text = mp_word_tokenize(text)
答案 0 :(得分:3)
这个答案已经过时了。 请参阅https://stackoverflow.com/a/54032108/610569而不是
这是骗子用sframe
进行多线程的方式:
>>> import sframe
>>> import time
>>> from nltk import word_tokenize
>>>
>>> import urllib.request
>>> url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
>>> response = urllib.request.urlopen(url)
>>> data = response.read().decode('utf8')
>>>
>>> for _ in range(10):
... start = time.time()
... for line in data.split('\n'):
... x = word_tokenize(line)
... print ('word_tokenize():\t', time.time() - start)
...
word_tokenize(): 4.058445692062378
word_tokenize(): 4.05820369720459
word_tokenize(): 4.090051174163818
word_tokenize(): 4.210559129714966
word_tokenize(): 4.17473030090332
word_tokenize(): 4.105806589126587
word_tokenize(): 4.082665681838989
word_tokenize(): 4.13646936416626
word_tokenize(): 4.185062408447266
word_tokenize(): 4.085020065307617
>>> sf = sframe.SFrame(data.split('\n'))
>>> for _ in range(10):
... start = time.time()
... x = list(sf.apply(lambda x: word_tokenize(x['X1'])))
... print ('word_tokenize() with sframe:\t', time.time() - start)
...
word_tokenize() with sframe: 7.174573659896851
word_tokenize() with sframe: 5.072867393493652
word_tokenize() with sframe: 5.129574775695801
word_tokenize() with sframe: 5.10952091217041
word_tokenize() with sframe: 5.015898942947388
word_tokenize() with sframe: 5.037845611572266
word_tokenize() with sframe: 5.015375852584839
word_tokenize() with sframe: 5.016635894775391
word_tokenize() with sframe: 5.155989170074463
word_tokenize() with sframe: 5.132697105407715
>>> for _ in range(10):
... start = time.time()
... x = [word_tokenize(line) for line in data.split('\n')]
... print ('str.split():\t', time.time() - start)
...
str.split(): 4.176181793212891
str.split(): 4.116339921951294
str.split(): 4.1104896068573
str.split(): 4.140819549560547
str.split(): 4.103625774383545
str.split(): 4.125757694244385
str.split(): 4.10755729675293
str.split(): 4.177418947219849
str.split(): 4.11145281791687
str.split(): 4.140623092651367
请注意,速度差异可能是因为我在其他核心上运行了其他内容。但是考虑到更大的数据集和专用内核,你可以真正看到这种规模。
答案 1 :(得分:0)
已经有几年了,SFrame
已经成为turicreate
的一部分:
与使用新的SFrame
(在Python3中)相比,速度提升意义重大。
from nltk import word_tokenize
from turicreate import SFrame
import time
from nltk import word_tokenize
import urllib.request
url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
response = urllib.request.urlopen(url)
data = response.read().decode('utf8')
lines = data.split('\n')
%%time
for _ in range(10):
start = time.time()
for line in lines:
x = word_tokenize(line)
print ('word_tokenize():\t', time.time() - start)
[输出]:
word_tokenize(): 4.619681119918823
word_tokenize(): 4.666991233825684
word_tokenize(): 4.452856779098511
word_tokenize(): 4.574898958206177
word_tokenize(): 4.536381959915161
word_tokenize(): 4.522706031799316
word_tokenize(): 4.742286682128906
word_tokenize(): 4.894973039627075
word_tokenize(): 4.813692808151245
word_tokenize(): 4.663335800170898
CPU times: user 44.9 s, sys: 330 ms, total: 45.2 s
Wall time: 46.5 s
sf = SFrame(data.split('\n'))
sf.materialize() # Reads data fully first
%%time
for _ in range(10):
start = time.time()
x = list(sf.apply(lambda x: word_tokenize(x['X1'])))
print ('word_tokenize() with sframe:\t', time.time() - start)
[输出]:
word_tokenize() with sframe: 3.2141151428222656
word_tokenize() with sframe: 3.129708766937256
word_tokenize() with sframe: 3.415634870529175
word_tokenize() with sframe: 3.433109760284424
word_tokenize() with sframe: 3.2390329837799072
word_tokenize() with sframe: 3.236827850341797
word_tokenize() with sframe: 3.3200089931488037
word_tokenize() with sframe: 3.367327928543091
word_tokenize() with sframe: 4.476067066192627
word_tokenize() with sframe: 4.064741134643555
CPU times: user 6.26 s, sys: 471 ms, total: 6.73 s
Wall time: 34.9 s
注意:SFrame是延迟计算的,.materialize()
会强制SFrame持久化到磁盘,并提交所有延迟计算的操作。
此外,您可以使用“非常简单”的并行化joblib
:
from joblib import Parallel, delayed
%%time
for _ in range(10):
start = time.time()
x = Parallel(n_jobs=4)(delayed(word_tokenize)(line) for line in lines)
print ('word_tokenize() with joblib:\t', time.time() - start)
[输出]:
word_tokenize() with joblib: 3.009906053543091
word_tokenize() with joblib: 4.92037296295166
word_tokenize() with joblib: 3.3748512268066406
word_tokenize() with joblib: 3.9530580043792725
word_tokenize() with joblib: 4.794445991516113
word_tokenize() with joblib: 3.7257909774780273
word_tokenize() with joblib: 4.811202049255371
word_tokenize() with joblib: 3.9719762802124023
word_tokenize() with joblib: 4.347040891647339
word_tokenize() with joblib: 3.958757162094116
CPU times: user 5.53 s, sys: 1.35 s, total: 6.88 s
Wall time: 40.9 s