Question

pandas或sklearn中是否有任何函数，如graphlab-create“ graphlab.text analytics.count_words”一样，可以对每一行的单词进行计数并在csv数据表中创建一列新的单词计数？

Answer 1

您当然可以做到。最简单的解决方案是使用Counter：

from collections import Counter

data = {
    "Sentence" : ["Hello World", "The world is mine", "World is big", "Hello you", "foo_bar bar", "temp"],
    "Foo" : ["1000", "750", "500", "25000", "2000", "1"]
}
df = pd.DataFrame(data)  # create a fake dataframe

# Create a counter for every words
counter = Counter()

# update the counter with every rows of you dataframe
df["Sentence"].str.split(" ").apply(counter.update)

# You can check the result as a dict with counter.most_common() but if you want a dataframe you can do
pd.DataFrame(c.most_common(), columns = ["Word", "freq"])

请注意，您可能必须预先对文本进行预处理（转换为较低的文本，请使用Stemmer，...）。例如，对于我的测试数据框，您有：

{'Hello'：2， 'The'：1， '世界'：2， 'bar'：1 '大'：1， 'foo_bar'：1， '是'：2， '我的'：1， 'temp'：1 '世界'：1， “您”：1}

您会看到您的“ World” = 2和“ world” = 1，因为我没有转换为上下文本。

您还可以查看其他解决方案，例如CountVectorizer（link）或TF-IDF Vectorizer（link）

我希望这会有所帮助，

尼古拉斯

graphlab与sklearn中的字数统计

1 个答案: