Question

我需要一些代码的帮助。我需要从函数中的文本中删除标点符号，然后将此函数应用于dataframe列中的列。我需要计算结果字符串中每个单词的频率，我称之为review_without_punctuation。然后，我需要将每个单词的计数存储在作为字典的列中。我试过了一个对单词计数的函数，并将其应用到review_without_punctuation，但是该函数无法运行。

这是我的尝试。

def remove_punctuation(text):

    import string
    from string import maketrans
    ##Multiply by number of punctuation characters
    table = string.maketrans('.?,!:;_', 7 * " ")
    ##takes care of float has no attribute translate
    products['review'] = products.fillna({'review':''})
    return text.translate(table)
review_without_punctuation = products['review'].apply(remove_punctuation)
##products['word_count'] = graphlab.text_analytics.count_words(review_without_punctuation)

products['word_count']= review_without_punctuation.str.split().str.len()

先谢谢了。

Answer 1

运行代码后，您似乎可以删除标点符号。我不熟悉graphlab，但是collections库提供了很好的计数工具。

我已更改您的代码，以使用collections.Counter数据类型为系列中的每一行创建一个字数字典。请注意，我将导入移动到了代码的开头（通常是很好的做法）。我还包括一个测试pandas.Dataframe对象，它很好，因此人们可以使用它们来测试您的代码并验证结果。

from collections import Counter
import pandas as pd
import string

def remove_punctuation(text):
    word_counter = Counter() # Initialize our counter
    ##Multiply by number of punctuation characters
    table = string.maketrans('.?,!:;_', 7 * " ")
    ##takes care of float has no attribute translate
    products['review'] = products.fillna({'review':''})
    for word in text.translate(table).split():
        word_counter[word] += 1
    return dict(word_counter)

products = pd.DataFrame({'review':['apple,orange','hello:goodbye']}) # test df
review_without_punctuation = products['review'].apply(remove_punctuation)
products['word_count']= review_without_punctuation

我修改了Python 3.x中的代码，我相信string.maketrans来自2.x，因此如果我没有正确将其转换回2.x，您可能必须修复它（我不在我的计算机上没有设置该环境）。我的输出如下：

In [1]: products
Out[1]: 
          review                  word_count
0   apple,orange   {'apple': 1, 'orange': 1}
1  hello:goodbye  {'hello': 1, 'goodbye': 1}

这是否会为您提供原始数据集所需的结果？

从文本中删除标点，然后将其存储为词典在数据列中

1 个答案: