Python多语言处理NLTK word_tokenizer - 函数永远不会完成

时间:2016-02-19 18:41:03

标签: python multiprocessing nltk python-multiprocessing

我在一些相当大的数据集上使用NLTK执行自然语言处理,并希望利用我的所有处理器核心。似乎多处理模块就在我之后,当我运行以下测试代码时,我看到所有核心都在被利用,但代码永远不会完成。

执行相同的任务,无需多处理,大约在一分钟内完成。

debian上的Python 2.7.11。

from nltk.tokenize import word_tokenize
import io
import time
import multiprocessing as mp

def open_file(filepath):
    #open and parse file
    file = io.open(filepath, 'rU', encoding='utf-8')
    text = file.read()
    return text

def mp_word_tokenize(text_to_process):
    #word tokenize
    start_time = time.clock()
    pool = mp.Pool(processes=8)
    word_tokens = pool.map(word_tokenize, text_to_process)
    finish_time = time.clock() - start_time
    print 'Finished word_tokenize in [' + str(finish_time) + '] seconds. Generated [' + str(len(word_tokens)) + '] tokens'
    return word_tokens

filepath = "./p40_compiled.txt"
text = open_file(filepath)
tokenized_text = mp_word_tokenize(text)

2 个答案:

答案 0 :(得分:3)

DEPRECATED

这个答案已经过时了。 请参阅https://stackoverflow.com/a/54032108/610569而不是

这是骗子用sframe进行多线程的方式:

>>> import sframe
>>> import time
>>> from nltk import word_tokenize
>>> 
>>> import urllib.request
>>> url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
>>> response = urllib.request.urlopen(url)
>>> data = response.read().decode('utf8')
>>> 
>>> for _ in range(10):
...     start = time.time()
...     for line in data.split('\n'):
...         x = word_tokenize(line)
...     print ('word_tokenize():\t', time.time() - start)
... 
word_tokenize():     4.058445692062378
word_tokenize():     4.05820369720459
word_tokenize():     4.090051174163818
word_tokenize():     4.210559129714966
word_tokenize():     4.17473030090332
word_tokenize():     4.105806589126587
word_tokenize():     4.082665681838989
word_tokenize():     4.13646936416626
word_tokenize():     4.185062408447266
word_tokenize():     4.085020065307617

>>> sf = sframe.SFrame(data.split('\n'))
>>> for _ in range(10):
...     start = time.time()
...     x = list(sf.apply(lambda x: word_tokenize(x['X1'])))
...     print ('word_tokenize() with sframe:\t', time.time() - start)
... 
word_tokenize() with sframe:     7.174573659896851
word_tokenize() with sframe:     5.072867393493652
word_tokenize() with sframe:     5.129574775695801
word_tokenize() with sframe:     5.10952091217041
word_tokenize() with sframe:     5.015898942947388
word_tokenize() with sframe:     5.037845611572266
word_tokenize() with sframe:     5.015375852584839
word_tokenize() with sframe:     5.016635894775391
word_tokenize() with sframe:     5.155989170074463
word_tokenize() with sframe:     5.132697105407715

>>> for _ in range(10):
...     start = time.time()
...     x = [word_tokenize(line) for line in data.split('\n')]
...     print ('str.split():\t', time.time() - start)
... 
str.split():     4.176181793212891
str.split():     4.116339921951294
str.split():     4.1104896068573
str.split():     4.140819549560547
str.split():     4.103625774383545
str.split():     4.125757694244385
str.split():     4.10755729675293
str.split():     4.177418947219849
str.split():     4.11145281791687
str.split():     4.140623092651367

请注意,速度差异可能是因为我在其他核心上运行了其他内容。但是考虑到更大的数据集和专用内核,你可以真正看到这种规模。

答案 1 :(得分:0)

已经有几年了,SFrame已经成为turicreate的一部分:

与使用新的SFrame(在Python3中)相比,速度提升意义重大。

在本地Python和NLTK中:

from nltk import word_tokenize
from turicreate import SFrame

import time
from nltk import word_tokenize

import urllib.request
url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
response = urllib.request.urlopen(url)
data = response.read().decode('utf8')
lines = data.split('\n')

%%time
for _ in range(10):
    start = time.time()
    for line in lines:
        x = word_tokenize(line)
    print ('word_tokenize():\t', time.time() - start)

[输出]:

word_tokenize():     4.619681119918823
word_tokenize():     4.666991233825684
word_tokenize():     4.452856779098511
word_tokenize():     4.574898958206177
word_tokenize():     4.536381959915161
word_tokenize():     4.522706031799316
word_tokenize():     4.742286682128906
word_tokenize():     4.894973039627075
word_tokenize():     4.813692808151245
word_tokenize():     4.663335800170898
CPU times: user 44.9 s, sys: 330 ms, total: 45.2 s
Wall time: 46.5 s

使用SFrame

sf = SFrame(data.split('\n'))
sf.materialize() # Reads data fully first

%%time

for _ in range(10):
    start = time.time()
    x = list(sf.apply(lambda x: word_tokenize(x['X1'])))
    print ('word_tokenize() with sframe:\t', time.time() - start)

[输出]:

word_tokenize() with sframe:     3.2141151428222656
word_tokenize() with sframe:     3.129708766937256
word_tokenize() with sframe:     3.415634870529175
word_tokenize() with sframe:     3.433109760284424
word_tokenize() with sframe:     3.2390329837799072
word_tokenize() with sframe:     3.236827850341797
word_tokenize() with sframe:     3.3200089931488037
word_tokenize() with sframe:     3.367327928543091
word_tokenize() with sframe:     4.476067066192627
word_tokenize() with sframe:     4.064741134643555
CPU times: user 6.26 s, sys: 471 ms, total: 6.73 s
Wall time: 34.9 s

注意:SFrame是延迟计算的,.materialize()会强制SFrame持久化到磁盘,并提交所有延迟计算的操作。


使用Joblib

此外,您可以使用“非常简单”的并行化joblib

from joblib import Parallel, delayed

%%time
for _ in range(10):
    start = time.time()
    x = Parallel(n_jobs=4)(delayed(word_tokenize)(line) for line in lines)
    print ('word_tokenize() with joblib:\t', time.time() - start)

[输出]:

word_tokenize() with joblib:     3.009906053543091
word_tokenize() with joblib:     4.92037296295166
word_tokenize() with joblib:     3.3748512268066406
word_tokenize() with joblib:     3.9530580043792725
word_tokenize() with joblib:     4.794445991516113
word_tokenize() with joblib:     3.7257909774780273
word_tokenize() with joblib:     4.811202049255371
word_tokenize() with joblib:     3.9719762802124023
word_tokenize() with joblib:     4.347040891647339
word_tokenize() with joblib:     3.958757162094116
CPU times: user 5.53 s, sys: 1.35 s, total: 6.88 s
Wall time: 40.9 s