如何在Python的for循环中使用多重处理来预处理pandas数据帧?

时间:2019-09-10 05:35:35

标签: python pandas nlp multiprocessing data-cleaning

我有一个8500行文本的数据集。我想在这些行的每行上应用一个函数pre_process。当我串行执行此操作时,在我的计算机上大约需要42分钟:

import pandas as pd
import time
import re

### constructing a sample dataframe of 10 rows to demonstrate
df = pd.DataFrame(columns=['text'])
df.text = ["The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .",
 "The gorgeously elaborate continuation of `` The Lord of the Rings '' trilogy is so huge that a column of words can not adequately describe co-writer/director Peter Jackson 's expanded vision of J.R.R. Tolkien 's Middle-earth .",
 'Singer/composer Bryan Adams contributes a slew of songs -- a few potential hits , a few more simply intrusive to the story -- but the whole package certainly captures the intended , er , spirit of the piece .',
 "You 'd think by now America would have had enough of plucky British eccentrics with hearts of gold .",
 'Yet the act is still charming here .',
 "Whether or not you 're enlightened by any of Derrida 's lectures on `` the other '' and `` the self , '' Derrida is an undeniably fascinating and playful fellow .",
 'Just the labour involved in creating the layered richness of the imagery in this chiaroscuro of madness and light is astonishing .',
 'Part of the charm of Satin Rouge is that it avoids the obvious with humour and lightness .',
 "a screenplay more ingeniously constructed than `` Memento ''",
 "`` Extreme Ops '' exceeds expectations ."]

def pre_process(text):
    '''
    function to pre-process and clean text
    '''
    stop_words = ['in', 'of', 'at', 'a', 'the']

    # lowercase
    text=str(text).lower()

    # remove special characters except spaces, apostrophes and dots
    text=re.sub(r"[^a-zA-Z0-9.']+", ' ', text)

    # remove stopwords
    text=[word for word in text.split(' ') if word not in stop_words]

    return ' '.join(text)

t = time.time()
for i in range(len(df)):
    df.text[i] = pre_process(df.text[i])

print('Time taken for pre-processing the data = {} mins'.format((time.time()-t)/60))

>>> Time taken for pre-processing the data = 41.95724259614944 mins

因此,我想为这个任务使用多重处理。我从here获得帮助,并编写了以下代码:

import pandas as pd
import multiprocessing as mp

pool = mp.Pool(mp.cpu_count())

def func(text):
    return pre_process(text)

t = time.time()
results = pool.map(func, [df.text[i] for i in range(len(df))])
print('Time taken for pre-processing the data = {} mins'.format((time.time()-t)/60))

pool.close()

但是代码只是继续运行,并且不会停止。

我如何使它工作?

2 个答案:

答案 0 :(得分:1)

您可以使用pandas.DataFrame.apply

Import of Microsoft.WebApplication.targets

答案 1 :(得分:1)

下面的代码虽然对我有用。我不会立即使用func而是使用pre_process。另外,我在池上使用上下文管理器/ with语句

下面是在IPython中运行的代码。

In [1]: from multiprocessing import Pool, TimeoutError 
    ...: import time 
    ...: import os           

In [2]: text = ["The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to 
    ...: make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .",
    ...:  
    ...:  "The gorgeously elaborate continuation of `` The Lord of the Rings '' trilogy is so huge that a
    ...:  column of words can not adequately describe co-writer/director Peter Jackson 's expanded vision
    ...:  of J.R.R. Tolkien 's Middle-earth .", 
    ...:  'Singer/composer Bryan Adams contributes a slew of songs -- a few potential hits , a few more s
    ...: imply intrusive to the story -- but the whole package certainly captures the intended , er , spi
    ...: rit of the piece .', 
    ...:  "You 'd think by now America would have had enough of plucky British eccentrics with hearts of 
    ...: gold .", 
    ...:  'Yet the act is still charming here .', 
    ...:  "Whether or not you 're enlightened by any of Derrida 's lectures on `` the other '' and `` the
    ...:  self , '' Derrida is an undeniably fascinating and playful fellow .", 
    ...:  'Just the labour involved in creating the layered richness of the imagery in this chiaroscuro o
    ...: f madness and light is astonishing .', 
    ...:  'Part of the charm of Satin Rouge is that it avoids the obvious with humour and lightness .', 
    ...:  "a screenplay more ingeniously constructed than `` Memento ''", 
    ...:  "`` Extreme Ops '' exceeds expectations ."]                       

In [3]: def pre_process(text): 
    ...:     ''' 
    ...:     function to pre-process and clean text 
    ...:     ''' 
    ...:     stop_words = ['in', 'of', 'at', 'a', 'the'] 
    ...:  
    ...:     # lowercase 
    ...:     text=str(text).lower() 
    ...:  
    ...:     # remove special characters except spaces, apostrophes and dots 
    ...:     text=re.sub(r"[^a-zA-Z0-9.']+", ' ', text) 
    ...:  
    ...:     # remove stopwords 
    ...:     text=[word for word in text.split(' ') if word not in stop_words] 
    ...:  
    ...:     return ' '.join(text) 


In [4]: %%time 
    ...: result = [] 
    ...: for x in text: 
    ...:     result.append(pre_process(x)) 
    ...:  
    ...:                                                                                                 
CPU times: user 500 µs, sys: 59 µs, total: 559 µs
Wall time: 569 µs

In [5]: %%time 
    ...: with Pool(mp.cpu_count()) as pool: 
    ...:     results = pool.map(pre_process, text) 
    ...:  
    ...:                                                                                          
CPU times: user 4.58 ms, sys: 29 ms, total: 33.6 ms
Wall time: 137 ms

In [6]: results                                                                                        
Out[6]: 
["rock is destined to be 21st century 's new conan '' and that he 's going to make splash even greater than arnold schwarzenegger jean claud van damme or steven segal .",
 "gorgeously elaborate continuation lord rings '' trilogy is so huge that column words can not adequately describe co writer director peter jackson 's expanded vision j.r.r. tolkien 's middle earth .",
 'singer composer bryan adams contributes slew songs few potential hits few more simply intrusive to story but whole package certainly captures intended er spirit piece .',
 "you 'd think by now america would have had enough plucky british eccentrics with hearts gold .",
 'yet act is still charming here .',
 "whether or not you 're enlightened by any derrida 's lectures on other '' and self '' derrida is an undeniably fascinating and playful fellow .",
 'just labour involved creating layered richness imagery this chiaroscuro madness and light is astonishing .',
 'part charm satin rouge is that it avoids obvious with humour and lightness .',
 "screenplay more ingeniously constructed than memento ''",
 " extreme ops '' exceeds expectations ."]

%%time是衡量单元执行时间的IPython魔术。当然,使用这样的小样本数据,由于创建新流程的开销,多处理实际上运行速度会变慢。

无论如何,您可以使用Pandas.DataFrame如下将列/ Series转换为按list()列出,而不用对其进行遍历,这样效率更高。

list(df.text)

下面是使用list()而不是像您一样反复进行迭代时的性能比较。总计为197 µs与564 µs。

In [52]: %%time 
    ...: [s[i] for i in range(len(s))] 
    ...:  
    ...:                                                                                                
CPU times: user 499 µs, sys: 65 µs, total: 564 µs
Wall time: 506 µs
Out[52]: 
["The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .",
 "The gorgeously elaborate continuation of `` The Lord of the Rings '' trilogy is so huge that a column of words can not adequately describe co-writer/director Peter Jackson 's expanded vision of J.R.R. Tolkien 's Middle-earth .",
 'Singer/composer Bryan Adams contributes a slew of songs -- a few potential hits , a few more simply intrusive to the story -- but the whole package certainly captures the intended , er , spirit of the piece .',
 "You 'd think by now America would have had enough of plucky British eccentrics with hearts of gold .",
 'Yet the act is still charming here .',
 "Whether or not you 're enlightened by any of Derrida 's lectures on `` the other '' and `` the self , '' Derrida is an undeniably fascinating and playful fellow .",
 'Just the labour involved in creating the layered richness of the imagery in this chiaroscuro of madness and light is astonishing .',
 'Part of the charm of Satin Rouge is that it avoids the obvious with humour and lightness .',
 "a screenplay more ingeniously constructed than `` Memento ''",
 "`` Extreme Ops '' exceeds expectations ."]

In [53]: %%time 
    ...: list(s) 
    ...:  
    ...:                                                                                                
CPU times: user 174 µs, sys: 23 µs, total: 197 µs
Wall time: 215 µs
Out[53]: 
["The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .",
 "The gorgeously elaborate continuation of `` The Lord of the Rings '' trilogy is so huge that a column of words can not adequately describe co-writer/director Peter Jackson 's expanded vision of J.R.R. Tolkien 's Middle-earth .",
 'Singer/composer Bryan Adams contributes a slew of songs -- a few potential hits , a few more simply intrusive to the story -- but the whole package certainly captures the intended , er , spirit of the piece .',
 "You 'd think by now America would have had enough of plucky British eccentrics with hearts of gold .",
 'Yet the act is still charming here .',
 "Whether or not you 're enlightened by any of Derrida 's lectures on `` the other '' and `` the self , '' Derrida is an undeniably fascinating and playful fellow .",
 'Just the labour involved in creating the layered richness of the imagery in this chiaroscuro of madness and light is astonishing .',
 'Part of the charm of Satin Rouge is that it avoids the obvious with humour and lightness .',
 "a screenplay more ingeniously constructed than `` Memento ''",
 "`` Extreme Ops '' exceeds expectations ."]