使用列表中的搜索词从文本中获取词计数的最快方法?

时间:2018-09-20 10:45:56

标签: python python-3.x pandas nltk

我实际上是在尝试做一个简单的分类器,所以我可以使用NLTK解决方案,但是我最初的几次尝试是使用Pandas。

我有几个列表,我想检查文本并获得字数统计,然后返回有序

import pandas as pd
import re
fruit_sentences = ["Monday: Yellow makes me happy.  So I eat a long, sweet fruit with a peel.",
                                "Tuesday: A fruit round red fruit with a green leaf a day keeps the doctor away.",
                                "Wednesday: The stout, sweet green fruit keeps me on my toes!",
                                "Thursday: Another day with the red round fruit.  I like to keep the green leaf.",
                                "Friday: Long yellow fruit day, peel it and it's ready to go."]
df = pd.DataFrame(fruit_sentences, columns = ['text'])
banana_words = ['yellow', 'long', 'peel']
apple_words = ['round', 'red', 'green leaf']
pear_words = ['stout', 'sweet', 'green']

print(df['text'].str.count(r'[XYZ_word in word list]'))

这是因为str.count()不接受列表而导致代码爆炸的地方。

最终目标是获得一个返回的元组列表,如下所示:

fruits = [('banana', 5), ('pear', 6), ('apple', 6)]

是的,我可以遍历所有列表来执行此操作,但似乎我只是不了解足够的Python,而不是Python不知道如何优雅地处理此问题。

我找到了这个问题,但似乎每个人都回答不正确或使用了与实际要求不同的解决方案,即here

谢谢您帮助这个新手解决这个问题!

4 个答案:

答案 0 :(得分:1)

我认为需要:

#create dict for names of lists
d = {'banana': banana_words, 'apple': apple_words, 'pear':pear_words}
#create one big list
L =  ' '.join(df['text'])

#count each value of lists and sum in generator
out = [(k, sum(L.count(x) for x in v)) for k,v in d.items()]
print (out)

[('banana', 4), ('apple', 6), ('pear', 6)]

如果要检查小写值:

#create one big list
L =  ' '.join(df['text']).lower()

#count each value of lists and sum in generator
out = [(k, sum(L.count(x) for x in v)) for k,v in d.items()]
print (out)

[('banana', 6), ('apple', 6), ('pear', 6)]

答案 1 :(得分:1)

在正则表达式中使用str.contains

# store lists in a dictionary for checking values.
a = {'banana': banana_words, 'apple': apple_words, 'pear':pear_words}

d = {}
# regular expression to match words
regex = '(?<!\S){0}[^\w\s]?(?!\S)'  

for i, j in a.items():
    d[i] = sum([df['text'].str.contains(regex.format(k), case=False).sum() for k in j])

print (d.items())

输出:

[('banana', 6), ('apple', 6), ('pear', 6)]

答案 2 :(得分:1)

怎么样:

python 3.6.4 / pandas 0.23.4:

import pandas as pd

def count(word_list):
    d = pd.Series(word_list).apply(lambda x: s.str.count(x))
    return d.sum()

fruit_sentences = ["Monday: Yellow makes me happy.  So I eat a long, sweet 
fruit with a peel.",
                        "Tuesday: A fruit round red fruit with a green leaf a day keeps the doctor away.",
                        "Wednesday: The stout, sweet green fruit keeps me on my toes!",
                        "Thursday: Another day with the red round fruit.  I like to keep the green leaf.",
                        "Friday: Long yellow fruit day, peel it and it's ready to go."]

banana_words = ['yellow', 'long', 'peel']
apple_words = ['round', 'red', 'green leaf']
pear_words = ['stout', 'sweet', 'green']

keywords = {'banana': banana_words, 'apple': apple_words, 'pear': pear_words}

s = pd.Series(fruit_sentences)
res = pd.DataFrame(columns=[])
res['type'] = pd.Series(list(keywords.keys()))
res['value'] = pd.Series(list(keywords.values())).apply(lambda x: count(x)).sum(axis=1)
print(list(res.itertuples(index=False, name=None)))

python 2.7.11 / pandas 0.17:

import pandas as pd


def count(word_list):
    d = pd.Series(word_list).apply(lambda x: s.str.count(x))
    return d.sum()


fruit_sentences = ["Monday: Yellow makes me happy.  So I eat a long, sweet fruit with a peel.",
                        "Tuesday: A fruit round red fruit with a green leaf a day keeps the doctor away.",
                        "Wednesday: The stout, sweet green fruit keeps me on my toes!",
                        "Thursday: Another day with the red round fruit.  I like to keep the green leaf.",
                        "Friday: Long yellow fruit day, peel it and it's ready to go."]

banana_words = ['yellow', 'long', 'peel']
apple_words = ['round', 'red', 'green leaf']
pear_words = ['stout', 'sweet', 'green']

keywords = {'banana': banana_words, 'apple': apple_words, 'pear': pear_words}

s = pd.Series(fruit_sentences)

res = pd.DataFrame(columns=[])
res['type'] = pd.Series(keywords.keys())
res['value'] = pd.Series(keywords.values()).apply(lambda x: count(x)).sum(axis=1)

print(list(res.itertuples(index=False)))

两者都会给您:

[('banana', 4), ('apple', 6), ('pear', 6)]

答案 3 :(得分:1)

为此,我将使用字典查找(超快速),并使用Counter O(n)创建字典。

# create a dict of look up values
d = {'banana': banana_words, 'apple': apple_words, 'pear':pear_words}

# preprocess data
df['text'] = df['text'].str.lower()
df['text'] = [re.sub(r'[^a-zA-Z0-9\s]','',x) for x in df['text']]
df['text'] = df.text.str.split()

# flatten the list and create a dict
from collections import Counter 

my_list = [i for s in df['text'] for i in s]
word_count = Counter(my_list)

# final job
output_dict = {k:len([x for x in v if x in word_count]) for k,v in d.items()}
sorted(output_dict.items(), key=lambda x: x[1])

[('apple', 2), ('banana', 3), ('pear', 3)]