我有一个包含两列的简单数据框。
+---------+-------+ | subject | score |
+---------+-------+ | wow | 0 |
+---------+-------+ | cool | 0 |
+---------+-------+ | hey | 0 |
+---------+-------+ | there | 0 |
+---------+-------+ | come on | 0 |
+---------+-------+ | welcome | 0 |
+---------+-------+
对于“主题”列中的每条记录,我正在调用一个函数并更新“得分”列中的结果:
df['score'] = df['subject'].apply(find_score)
Here find_score is a function, which processes strings and returns a score :
def find_score (row):
# Imports the Google Cloud client library
from google.cloud import language
# Instantiates a client
language_client = language.Client()
import re
pre_text = re.sub('<[^>]*>', '', row)
text = re.sub(r'[^\w]', ' ', pre_text)
document = language_client.document_from_text(text)
# Detects the sentiment of the text
sentiment = document.analyze_sentiment().sentiment
print("Sentiment score - %f " % sentiment.score)
return sentiment.score
这可以正常工作,但是它逐渐处理记录时速度很慢。
有没有办法,这可以并行化吗?没有手动将数据帧拆分成更小的块?有没有自动执行此操作的库?
干杯
答案 0 :(得分:4)
每次调用language.Client
函数时find_score
的实例化可能是一个主要的瓶颈。您无需为每次使用该函数创建新的客户端实例,因此在调用之前尝试在函数外部创建它:
# Instantiates a client
language_client = language.Client()
def find_score (row):
# Imports the Google Cloud client library
from google.cloud import language
import re
pre_text = re.sub('<[^>]*>', '', row)
text = re.sub(r'[^\w]', ' ', pre_text)
document = language_client.document_from_text(text)
# Detects the sentiment of the text
sentiment = document.analyze_sentiment().sentiment
print("Sentiment score - %f " % sentiment.score)
return sentiment.score
df['score'] = df['subject'].apply(find_score)
如果你坚持,你可以使用这样的多处理:
from multiprocessing import Pool
# <Define functions and datasets here>
pool = Pool(processes = 8) # or some number of your choice
df['score'] = pool.map(find_score, df['subject'])
pool.terminate()