Question

我实际上是在尝试做一个简单的分类器，所以我可以使用NLTK解决方案，但是我最初的几次尝试是使用Pandas。

我有几个列表，我想检查文本并获得字数统计，然后返回有序

import pandas as pd
import re
fruit_sentences = ["Monday: Yellow makes me happy.  So I eat a long, sweet fruit with a peel.",
                                "Tuesday: A fruit round red fruit with a green leaf a day keeps the doctor away.",
                                "Wednesday: The stout, sweet green fruit keeps me on my toes!",
                                "Thursday: Another day with the red round fruit.  I like to keep the green leaf.",
                                "Friday: Long yellow fruit day, peel it and it's ready to go."]
df = pd.DataFrame(fruit_sentences, columns = ['text'])
banana_words = ['yellow', 'long', 'peel']
apple_words = ['round', 'red', 'green leaf']
pear_words = ['stout', 'sweet', 'green']

print(df['text'].str.count(r'[XYZ_word in word list]'))

这是因为str.count（）不接受列表而导致代码爆炸的地方。

最终目标是获得一个返回的元组列表，如下所示：

fruits = [('banana', 5), ('pear', 6), ('apple', 6)]

是的，我可以遍历所有列表来执行此操作，但似乎我只是不了解足够的Python，而不是Python不知道如何优雅地处理此问题。

我找到了这个问题，但似乎每个人都回答不正确或使用了与实际要求不同的解决方案，即here。

谢谢您帮助这个新手解决这个问题！

Answer 1

我认为需要：

#create dict for names of lists
d = {'banana': banana_words, 'apple': apple_words, 'pear':pear_words}
#create one big list
L =  ' '.join(df['text'])

#count each value of lists and sum in generator
out = [(k, sum(L.count(x) for x in v)) for k,v in d.items()]
print (out)

[('banana', 4), ('apple', 6), ('pear', 6)]

如果要检查小写值：

#create one big list
L =  ' '.join(df['text']).lower()

#count each value of lists and sum in generator
out = [(k, sum(L.count(x) for x in v)) for k,v in d.items()]
print (out)

[('banana', 6), ('apple', 6), ('pear', 6)]

Answer 2

在正则表达式中使用str.contains。

# store lists in a dictionary for checking values.
a = {'banana': banana_words, 'apple': apple_words, 'pear':pear_words}

d = {}
# regular expression to match words
regex = '(?<!\S){0}[^\w\s]?(?!\S)'  

for i, j in a.items():
    d[i] = sum([df['text'].str.contains(regex.format(k), case=False).sum() for k in j])

print (d.items())

输出：

[('banana', 6), ('apple', 6), ('pear', 6)]

Answer 3

怎么样：

python 3.6.4 / pandas 0.23.4：

import pandas as pd

def count(word_list):
    d = pd.Series(word_list).apply(lambda x: s.str.count(x))
    return d.sum()

fruit_sentences = ["Monday: Yellow makes me happy.  So I eat a long, sweet 
fruit with a peel.",
                        "Tuesday: A fruit round red fruit with a green leaf a day keeps the doctor away.",
                        "Wednesday: The stout, sweet green fruit keeps me on my toes!",
                        "Thursday: Another day with the red round fruit.  I like to keep the green leaf.",
                        "Friday: Long yellow fruit day, peel it and it's ready to go."]

banana_words = ['yellow', 'long', 'peel']
apple_words = ['round', 'red', 'green leaf']
pear_words = ['stout', 'sweet', 'green']

keywords = {'banana': banana_words, 'apple': apple_words, 'pear': pear_words}

s = pd.Series(fruit_sentences)
res = pd.DataFrame(columns=[])
res['type'] = pd.Series(list(keywords.keys()))
res['value'] = pd.Series(list(keywords.values())).apply(lambda x: count(x)).sum(axis=1)
print(list(res.itertuples(index=False, name=None)))

python 2.7.11 / pandas 0.17：

import pandas as pd


def count(word_list):
    d = pd.Series(word_list).apply(lambda x: s.str.count(x))
    return d.sum()


fruit_sentences = ["Monday: Yellow makes me happy.  So I eat a long, sweet fruit with a peel.",
                        "Tuesday: A fruit round red fruit with a green leaf a day keeps the doctor away.",
                        "Wednesday: The stout, sweet green fruit keeps me on my toes!",
                        "Thursday: Another day with the red round fruit.  I like to keep the green leaf.",
                        "Friday: Long yellow fruit day, peel it and it's ready to go."]

banana_words = ['yellow', 'long', 'peel']
apple_words = ['round', 'red', 'green leaf']
pear_words = ['stout', 'sweet', 'green']

keywords = {'banana': banana_words, 'apple': apple_words, 'pear': pear_words}

s = pd.Series(fruit_sentences)

res = pd.DataFrame(columns=[])
res['type'] = pd.Series(keywords.keys())
res['value'] = pd.Series(keywords.values()).apply(lambda x: count(x)).sum(axis=1)

print(list(res.itertuples(index=False)))

两者都会给您：

[('banana', 4), ('apple', 6), ('pear', 6)]

Answer 4

为此，我将使用字典查找（超快速），并使用Counter O（n）创建字典。

# create a dict of look up values
d = {'banana': banana_words, 'apple': apple_words, 'pear':pear_words}

# preprocess data
df['text'] = df['text'].str.lower()
df['text'] = [re.sub(r'[^a-zA-Z0-9\s]','',x) for x in df['text']]
df['text'] = df.text.str.split()

# flatten the list and create a dict
from collections import Counter 

my_list = [i for s in df['text'] for i in s]
word_count = Counter(my_list)

# final job
output_dict = {k:len([x for x in v if x in word_count]) for k,v in d.items()}
sorted(output_dict.items(), key=lambda x: x[1])

[('apple', 2), ('banana', 3), ('pear', 3)]

使用列表中的搜索词从文本中获取词计数的最快方法？

4 个答案:

输出：