我正在做一些NLP,需要清理我的数据。我写了三个函数来1)清理数据,2)检查数据是否在主题上,3)检查数据是否为英文。
我有800万行数据,大多数计算都不依赖于彼此。我正在考虑使用Pool并行化代码,但我不确定这是否明智,因为所有数据都存放在Pandas数据帧中(我知道numba不适合数据帧)。
我可以使用Pool并行化我的代码吗?它是否像我在文档中找到的代码一样简单? Pool甚至是正确的图书馆吗?
应该注意,我在Mac OSX上运行它。 这是我的参考代码:
import pandas as pd
import re
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords
import enchant
import numpy as np
sf = pd.read_csv('timeandtweet.csv')
def clean_tweet(x):
cleaning = BeautifulSoup(x,"lxml")
letters_only = re.sub("[^a-zA-Z]"," ", cleaning.get_text())
words = letters_only.lower().split()
words = [w for w in words if not w in (stopwords.words("english")+[u'rt'])]
return " ".join(words)
def on_topic(x):
topics = [u'measles',u'mmr',u'vaccine',u'vaccines']
if any(j in topics for j in x.split()):
return 1
else:
return -1
def is_english(x):
lang = enchant.Dict('en_US')
L = len(x.split())
words = []
for i in x.split():
words.append(lang.check(i))
if float(sum(words))/L <0.6:
return -1
else:
return 1
sf['Clean Tweet'] = np.zeros_like(sf.Tweet)
sf['English-Topic'] = np.zeros_like(sf.Tweet)
for i in xrange(len(sf)): #Loop instead of df.apply for speed?
if( (i+1)%1000 == 0 ):
print "Review %d of %d\n" % ( i+1, len(sf) )
sf['Clean Tweet'][i] = clean_tweet(sf.Tweet[i])
sf['English-Topic'][i] = (on_topic(sf['Clean Tweet'][i]), is_english(sf['Clean Tweet'][i]) )
sf.to_csv('cleaned_processed.csv', index = False)
我尝试并行化
sf['Clean Tweet'] = np.zeros_like(sf.Tweet)
sf['English-Topic'] = np.zeros_like(sf.Tweet)
from multiprocessing import Pool
pool = Pool()
result1 = pool.apply_async(clean_tweet,[sf.Tweet])
answer1 = result1.get()
但我一直有价值错误。
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()