如何优化我的熊猫数据框预处理?

时间:2019-08-02 20:48:20

标签: python pandas data-cleaning

我有一个具有数十万行的熊猫数据框和一列df ['reviews'],其中是产品的文本评论。我正在清理数据,但是预处理需要很长时间。您能否提供有关如何优化我的代码的建议?预先感谢。

# import useful libraries
import pandas as pd
from langdetect import detect
import nltk
from html2text import unescape
from nltk.corpus import stopwords

# define corpus
words = set(nltk.corpus.words.words())

# define stopwords
stop = stopwords.words('english')
newStopWords = ['oz','stopWord2']
stop.extend(newStopWords)

# read csv into dataframe
df=pd.read_csv('./data.csv')

# unescape reviews (fix html encoding)
df['clean_reviews'] = df['reviews'].apply(unescape, unicode_snob=True)

# remove non-ASCII characters
df['clean_reviews'] = df["clean_reviews"].apply(lambda x: ''.join([" " if ord(i) < 32 or ord(i) > 126 else i for i in x]))

# calculate number of stop words in raw reviews
df['stopwords'] = df['reviews'].apply(lambda x: len([x for x in x.split() if x in stop]))

# lowercase reviews
df['clean_reviews'] = df['clean_reviews'].apply(lambda x: " ".join(x.lower() for x in x.split()))

# add a space before and after every punctuation 
df['clean_reviews'] = df['clean_reviews'].str.replace(r'([^\w\s]+)', ' \\1 ')

# remove punctuation
df['clean_reviews'] = df['clean_reviews'].str.replace('[^\w\s]','')

# remove stopwords
df['clean_reviews'] = df['clean_reviews'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))

# remove digits
df['clean_reviews'] = df['clean_reviews'].str.replace('\d+', '')

# remove non-corpus words
def remove_noncorpus(sentence):
    print(sentence)
    return " ".join(w for w in nltk.wordpunct_tokenize(sentence) if w.lower() in words or not w.isalpha())

df['clean_reviews'] = df['clean_reviews'].map(remove_noncorpus)

# count number of characters
df['character_count'] = df['clean_reviews'].apply(len)

# count number of words
df['word_count'] = df['clean_reviews'].str.split().str.len()

# average word length
def avg_word(sentence):
  words = sentence.split()
  print(sentence)
  return (sum(len(word) for word in words)/len(words))

df['avg_word'] = df['clean_reviews'].apply(lambda x: avg_word(x))
df[['clean_reviews','avg_word']].head()

# detect language of reviews
df['language'] = df['clean_reviews'].apply(detect)

# filter out non-English reviews
msk = (df['language'] == 'en')
df_range = df[msk]

# write dataframe to csv
df_range.to_csv('dataclean.csv', index=False)

上面发布的代码完成了我需要的一切;但是,它需要几个小时才能完成。我将对任何有关减少处理时间的有用建议表示赞赏。如果您需要其他详细信息,请告诉我。

1 个答案:

答案 0 :(得分:3)

1)如何查找程序中最耗时的部分

首先,您必须查看程序中大部分时间花费在哪里。如上面的注释中所述,可以“手动”完成此操作,方法是在每个步骤之后插入print(),以直观地了解程序进度。为了获得定量结果,您可以将每个步骤包裹在start = time.time()print('myProgramStep: {}'.format(time.time() - start))调用中。只要您的程序相对较短,就可以了,否则将变得很艰巨。

最好的方法是使用分析器。 Python带有内置的profiler,但使用起来有点麻烦: 首先,我们使用cProfile对程序进行概要分析,然后使用pstats加载概要文件以供审核:

python3 -m cProfile -o so57333255.py.prof so57333255.py
python3 -m pstats  so57333255.py.prof

pstats内,我们输入sort cumtime以按在一个函数中使用的时间以及它所调用的所有函数的时间进行排序,并输入stats 5以显示前5个条目:

         2351652 function calls (2335973 primitive calls) in 9.843 seconds

   Ordered by: cumulative time
   List reduced from 4964 to 5 due to restriction <5>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   1373/1    0.145    0.000    9.852    9.852 {built-in method exec}
        1    0.079    0.079    9.852    9.852 so57333255.py:2(<module>)
        9    0.003    0.000    5.592    0.621 {pandas._libs.lib.map_infer}
        8    0.001    0.000    5.582    0.698 /usr/local/lib/python3.4/dist-packages/pandas/core/series.py:2230(apply)
      100    0.001    0.000    5.341    0.053 /usr/local/lib/python3.4/dist-packages/langdetect/detector_factory.py:126(detect)

从这里我们了解到,程序中最昂贵的单个函数是apply,被调用了8次-但从这里我们看不到这8个调用是否花费的时间或多或少是相同的如果花了特别长的时间。但是,在下一行,我们看到detect的时间为5.341秒,即,所有8个apply呼叫的全部5.582秒的时间都花在了apply(detect)上。您可以使用callerscallees命令获得更多的见解,但是您看到的并不是很方便。

line profiler是一种更加用户友好的方法。它使用@profile装饰器来分析对函数的调用,因此我们必须将整个程序放入具有装饰器的函数中,然后调用此函数。然后我们得到以下结果:

Total time: 8.59578 s
File: so57333255a.py
Function: runit at line 8

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     8                                           @profile
     9                                           def runit():
    10                                           
    11                                               # define corpus
    12         1     385710.0 385710.0      4.5      words = set(nltk.corpus.words.words())
    13                                           
    14                                               # define stopwords
    15         1       2068.0   2068.0      0.0      stop = stopwords.words('english')
    16         1         10.0     10.0      0.0      newStopWords = ['oz','stopWord2']
    17         1          9.0      9.0      0.0      stop.extend(newStopWords)
    18                                           
    19                                               # read csv into dataframe
    20         1      46880.0  46880.0      0.5      df=pd.read_csv('reviews.csv', names=['reviews'], header=None, nrows=100)
    21                                           
    22                                               # unescape reviews (fix html encoding)
    23         1      16922.0  16922.0      0.2      df['clean_reviews'] = df['reviews'].apply(unescape, unicode_snob=True)
    24                                           
    25                                               # remove non-ASCII characters
    26         1      15133.0  15133.0      0.2      df['clean_reviews'] = df["clean_reviews"].apply(lambda x: ''.join([" " if ord(i) < 32 or ord(i) > 126 else i for i in x]))
    27                                           
    28                                               # calculate number of stop words in raw reviews
    29         1      20721.0  20721.0      0.2      df['stopwords'] = df['reviews'].apply(lambda x: len([x for x in x.split() if x in stop]))
    30                                           
    31                                               # lowercase reviews
    32         1       5325.0   5325.0      0.1      df['clean_reviews'] = df['clean_reviews'].apply(lambda x: " ".join(x.lower() for x in x.split()))
    33                                           
    34                                               # add a space before and after every punctuation 
    35         1       9834.0   9834.0      0.1      df['clean_reviews'] = df['clean_reviews'].str.replace(r'([^\w\s]+)', ' \\1 ')
    36                                           
    37                                               # remove punctuation
    38         1       3262.0   3262.0      0.0      df['clean_reviews'] = df['clean_reviews'].str.replace('[^\w\s]','')
    39                                           
    40                                               # remove stopwords
    41         1      20259.0  20259.0      0.2      df['clean_reviews'] = df['clean_reviews'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
    42                                           
    43                                               # remove digits
    44         1       2897.0   2897.0      0.0      df['clean_reviews'] = df['clean_reviews'].str.replace('\d+', '')
    45                                           
    46                                               # remove non-corpus words
    47         1          9.0      9.0      0.0      def remove_noncorpus(sentence):
    48                                                   #print(sentence)
    49                                                   return " ".join(w for w in nltk.wordpunct_tokenize(sentence) if w.lower() in words or not w.isalpha())
    50                                           
    51         1       6698.0   6698.0      0.1      df['clean_reviews'] = df['clean_reviews'].map(remove_noncorpus)
    52                                           
    53                                               # count number of characters
    54         1       1912.0   1912.0      0.0      df['character_count'] = df['clean_reviews'].apply(len)
    55                                           
    56                                               # count number of words
    57         1       3641.0   3641.0      0.0      df['word_count'] = df['clean_reviews'].str.split().str.len()
    58                                           
    59                                               # average word length
    60         1          9.0      9.0      0.0      def avg_word(sentence):
    61                                                 words = sentence.split()
    62                                                 #print(sentence)
    63                                                 return (sum(len(word) for word in words)/len(words)) if len(words)>0 else 0
    64                                           
    65         1       3445.0   3445.0      0.0      df['avg_word'] = df['clean_reviews'].apply(lambda x: avg_word(x))
    66         1       3786.0   3786.0      0.0      df[['clean_reviews','avg_word']].head()
    67                                           
    68                                               # detect language of reviews
    69         1    8037362.0 8037362.0     93.5      df['language'] = df['clean_reviews'].apply(detect)
    70                                           
    71                                               # filter out non-English reviews
    72         1       1453.0   1453.0      0.0      msk = (df['language'] == 'en')
    73         1       2353.0   2353.0      0.0      df_range = df[msk]
    74                                           
    75                                               # write dataframe to csv
    76         1       6087.0   6087.0      0.1      df_range.to_csv('dataclean.csv', index=False) 

从这里我们可以直接看到总时间的93.5%用于df['language'] = df['clean_reviews'].apply(detect)
这是我的玩具示例,只有100行,而对于5K行,则将超过99%。

2)如何使其更快

因此,大部分时间都花在了语言检测上。 detect使用的算法的详细信息可以在here中找到。事实证明,文本的大约40至 50个字符足以确定语言,因此,如果您的评论篇幅较长,可以对整个文本应用detect来节省一些时间文字,但仅前50个字符。根据您评论的平均时长,可以使速度提高几个百分点。

由于detect函数没有太多要优化的方法,因此唯一的方法是用更快的速度替换它,例如Google的紧凑型语言检测器CLD2CLD3。我选择了后者,结果比detect快了约 100倍。另一个快速的选择是langid,其速度与this paper中的CLD2相比。