我有一个具有数十万行的熊猫数据框和一列df ['reviews'],其中是产品的文本评论。我正在清理数据,但是预处理需要很长时间。您能否提供有关如何优化我的代码的建议?预先感谢。
# import useful libraries
import pandas as pd
from langdetect import detect
import nltk
from html2text import unescape
from nltk.corpus import stopwords
# define corpus
words = set(nltk.corpus.words.words())
# define stopwords
stop = stopwords.words('english')
newStopWords = ['oz','stopWord2']
stop.extend(newStopWords)
# read csv into dataframe
df=pd.read_csv('./data.csv')
# unescape reviews (fix html encoding)
df['clean_reviews'] = df['reviews'].apply(unescape, unicode_snob=True)
# remove non-ASCII characters
df['clean_reviews'] = df["clean_reviews"].apply(lambda x: ''.join([" " if ord(i) < 32 or ord(i) > 126 else i for i in x]))
# calculate number of stop words in raw reviews
df['stopwords'] = df['reviews'].apply(lambda x: len([x for x in x.split() if x in stop]))
# lowercase reviews
df['clean_reviews'] = df['clean_reviews'].apply(lambda x: " ".join(x.lower() for x in x.split()))
# add a space before and after every punctuation
df['clean_reviews'] = df['clean_reviews'].str.replace(r'([^\w\s]+)', ' \\1 ')
# remove punctuation
df['clean_reviews'] = df['clean_reviews'].str.replace('[^\w\s]','')
# remove stopwords
df['clean_reviews'] = df['clean_reviews'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
# remove digits
df['clean_reviews'] = df['clean_reviews'].str.replace('\d+', '')
# remove non-corpus words
def remove_noncorpus(sentence):
print(sentence)
return " ".join(w for w in nltk.wordpunct_tokenize(sentence) if w.lower() in words or not w.isalpha())
df['clean_reviews'] = df['clean_reviews'].map(remove_noncorpus)
# count number of characters
df['character_count'] = df['clean_reviews'].apply(len)
# count number of words
df['word_count'] = df['clean_reviews'].str.split().str.len()
# average word length
def avg_word(sentence):
words = sentence.split()
print(sentence)
return (sum(len(word) for word in words)/len(words))
df['avg_word'] = df['clean_reviews'].apply(lambda x: avg_word(x))
df[['clean_reviews','avg_word']].head()
# detect language of reviews
df['language'] = df['clean_reviews'].apply(detect)
# filter out non-English reviews
msk = (df['language'] == 'en')
df_range = df[msk]
# write dataframe to csv
df_range.to_csv('dataclean.csv', index=False)
上面发布的代码完成了我需要的一切;但是,它需要几个小时才能完成。我将对任何有关减少处理时间的有用建议表示赞赏。如果您需要其他详细信息,请告诉我。
答案 0 :(得分:3)
首先,您必须查看程序中大部分时间花费在哪里。如上面的注释中所述,可以“手动”完成此操作,方法是在每个步骤之后插入print()
,以直观地了解程序进度。为了获得定量结果,您可以将每个步骤包裹在start = time.time()
和print('myProgramStep: {}'.format(time.time() - start))
调用中。只要您的程序相对较短,就可以了,否则将变得很艰巨。
最好的方法是使用分析器。 Python带有内置的profiler,但使用起来有点麻烦:
首先,我们使用cProfile
对程序进行概要分析,然后使用pstats
加载概要文件以供审核:
python3 -m cProfile -o so57333255.py.prof so57333255.py
python3 -m pstats so57333255.py.prof
在pstats
内,我们输入sort cumtime
以按在一个函数中使用的时间以及它所调用的所有函数的时间进行排序,并输入stats 5
以显示前5个条目:
2351652 function calls (2335973 primitive calls) in 9.843 seconds
Ordered by: cumulative time
List reduced from 4964 to 5 due to restriction <5>
ncalls tottime percall cumtime percall filename:lineno(function)
1373/1 0.145 0.000 9.852 9.852 {built-in method exec}
1 0.079 0.079 9.852 9.852 so57333255.py:2(<module>)
9 0.003 0.000 5.592 0.621 {pandas._libs.lib.map_infer}
8 0.001 0.000 5.582 0.698 /usr/local/lib/python3.4/dist-packages/pandas/core/series.py:2230(apply)
100 0.001 0.000 5.341 0.053 /usr/local/lib/python3.4/dist-packages/langdetect/detector_factory.py:126(detect)
从这里我们了解到,程序中最昂贵的单个函数是apply
,被调用了8次-但从这里我们看不到这8个调用是否花费的时间或多或少是相同的如果花了特别长的时间。但是,在下一行,我们看到detect
的时间为5.341秒,即,所有8个apply
呼叫的全部5.582秒的时间都花在了apply(detect)
上。您可以使用callers
和callees
命令获得更多的见解,但是您看到的并不是很方便。
line profiler是一种更加用户友好的方法。它使用@profile
装饰器来分析对函数的调用,因此我们必须将整个程序放入具有装饰器的函数中,然后调用此函数。然后我们得到以下结果:
Total time: 8.59578 s
File: so57333255a.py
Function: runit at line 8
Line # Hits Time Per Hit % Time Line Contents
==============================================================
8 @profile
9 def runit():
10
11 # define corpus
12 1 385710.0 385710.0 4.5 words = set(nltk.corpus.words.words())
13
14 # define stopwords
15 1 2068.0 2068.0 0.0 stop = stopwords.words('english')
16 1 10.0 10.0 0.0 newStopWords = ['oz','stopWord2']
17 1 9.0 9.0 0.0 stop.extend(newStopWords)
18
19 # read csv into dataframe
20 1 46880.0 46880.0 0.5 df=pd.read_csv('reviews.csv', names=['reviews'], header=None, nrows=100)
21
22 # unescape reviews (fix html encoding)
23 1 16922.0 16922.0 0.2 df['clean_reviews'] = df['reviews'].apply(unescape, unicode_snob=True)
24
25 # remove non-ASCII characters
26 1 15133.0 15133.0 0.2 df['clean_reviews'] = df["clean_reviews"].apply(lambda x: ''.join([" " if ord(i) < 32 or ord(i) > 126 else i for i in x]))
27
28 # calculate number of stop words in raw reviews
29 1 20721.0 20721.0 0.2 df['stopwords'] = df['reviews'].apply(lambda x: len([x for x in x.split() if x in stop]))
30
31 # lowercase reviews
32 1 5325.0 5325.0 0.1 df['clean_reviews'] = df['clean_reviews'].apply(lambda x: " ".join(x.lower() for x in x.split()))
33
34 # add a space before and after every punctuation
35 1 9834.0 9834.0 0.1 df['clean_reviews'] = df['clean_reviews'].str.replace(r'([^\w\s]+)', ' \\1 ')
36
37 # remove punctuation
38 1 3262.0 3262.0 0.0 df['clean_reviews'] = df['clean_reviews'].str.replace('[^\w\s]','')
39
40 # remove stopwords
41 1 20259.0 20259.0 0.2 df['clean_reviews'] = df['clean_reviews'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
42
43 # remove digits
44 1 2897.0 2897.0 0.0 df['clean_reviews'] = df['clean_reviews'].str.replace('\d+', '')
45
46 # remove non-corpus words
47 1 9.0 9.0 0.0 def remove_noncorpus(sentence):
48 #print(sentence)
49 return " ".join(w for w in nltk.wordpunct_tokenize(sentence) if w.lower() in words or not w.isalpha())
50
51 1 6698.0 6698.0 0.1 df['clean_reviews'] = df['clean_reviews'].map(remove_noncorpus)
52
53 # count number of characters
54 1 1912.0 1912.0 0.0 df['character_count'] = df['clean_reviews'].apply(len)
55
56 # count number of words
57 1 3641.0 3641.0 0.0 df['word_count'] = df['clean_reviews'].str.split().str.len()
58
59 # average word length
60 1 9.0 9.0 0.0 def avg_word(sentence):
61 words = sentence.split()
62 #print(sentence)
63 return (sum(len(word) for word in words)/len(words)) if len(words)>0 else 0
64
65 1 3445.0 3445.0 0.0 df['avg_word'] = df['clean_reviews'].apply(lambda x: avg_word(x))
66 1 3786.0 3786.0 0.0 df[['clean_reviews','avg_word']].head()
67
68 # detect language of reviews
69 1 8037362.0 8037362.0 93.5 df['language'] = df['clean_reviews'].apply(detect)
70
71 # filter out non-English reviews
72 1 1453.0 1453.0 0.0 msk = (df['language'] == 'en')
73 1 2353.0 2353.0 0.0 df_range = df[msk]
74
75 # write dataframe to csv
76 1 6087.0 6087.0 0.1 df_range.to_csv('dataclean.csv', index=False)
从这里我们可以直接看到总时间的93.5%用于df['language'] = df['clean_reviews'].apply(detect)
。
这是我的玩具示例,只有100行,而对于5K行,则将超过99%。
因此,大部分时间都花在了语言检测上。 detect
使用的算法的详细信息可以在here中找到。事实证明,文本的大约40至 50个字符足以确定语言,因此,如果您的评论篇幅较长,可以对整个文本应用detect
来节省一些时间文字,但仅前50个字符。根据您评论的平均时长,可以使速度提高几个百分点。
由于detect
函数没有太多要优化的方法,因此唯一的方法是用更快的速度替换它,例如Google的紧凑型语言检测器CLD2或CLD3。我选择了后者,结果比detect
快了约 100倍。另一个快速的选择是langid
,其速度与this paper中的CLD2相比。