我实际上是在尝试做一个简单的分类器,所以我可以使用NLTK解决方案,但是我最初的几次尝试是使用Pandas。
我有几个列表,我想检查文本并获得字数统计,然后返回有序
import pandas as pd
import re
fruit_sentences = ["Monday: Yellow makes me happy. So I eat a long, sweet fruit with a peel.",
"Tuesday: A fruit round red fruit with a green leaf a day keeps the doctor away.",
"Wednesday: The stout, sweet green fruit keeps me on my toes!",
"Thursday: Another day with the red round fruit. I like to keep the green leaf.",
"Friday: Long yellow fruit day, peel it and it's ready to go."]
df = pd.DataFrame(fruit_sentences, columns = ['text'])
banana_words = ['yellow', 'long', 'peel']
apple_words = ['round', 'red', 'green leaf']
pear_words = ['stout', 'sweet', 'green']
print(df['text'].str.count(r'[XYZ_word in word list]'))
这是因为str.count()不接受列表而导致代码爆炸的地方。
最终目标是获得一个返回的元组列表,如下所示:
fruits = [('banana', 5), ('pear', 6), ('apple', 6)]
是的,我可以遍历所有列表来执行此操作,但似乎我只是不了解足够的Python,而不是Python不知道如何优雅地处理此问题。
我找到了这个问题,但似乎每个人都回答不正确或使用了与实际要求不同的解决方案,即here。
谢谢您帮助这个新手解决这个问题!
答案 0 :(得分:1)
我认为需要:
#create dict for names of lists
d = {'banana': banana_words, 'apple': apple_words, 'pear':pear_words}
#create one big list
L = ' '.join(df['text'])
#count each value of lists and sum in generator
out = [(k, sum(L.count(x) for x in v)) for k,v in d.items()]
print (out)
[('banana', 4), ('apple', 6), ('pear', 6)]
如果要检查小写值:
#create one big list
L = ' '.join(df['text']).lower()
#count each value of lists and sum in generator
out = [(k, sum(L.count(x) for x in v)) for k,v in d.items()]
print (out)
[('banana', 6), ('apple', 6), ('pear', 6)]
答案 1 :(得分:1)
在正则表达式中使用str.contains
。
# store lists in a dictionary for checking values.
a = {'banana': banana_words, 'apple': apple_words, 'pear':pear_words}
d = {}
# regular expression to match words
regex = '(?<!\S){0}[^\w\s]?(?!\S)'
for i, j in a.items():
d[i] = sum([df['text'].str.contains(regex.format(k), case=False).sum() for k in j])
print (d.items())
[('banana', 6), ('apple', 6), ('pear', 6)]
答案 2 :(得分:1)
怎么样:
python 3.6.4 / pandas 0.23.4:
import pandas as pd
def count(word_list):
d = pd.Series(word_list).apply(lambda x: s.str.count(x))
return d.sum()
fruit_sentences = ["Monday: Yellow makes me happy. So I eat a long, sweet
fruit with a peel.",
"Tuesday: A fruit round red fruit with a green leaf a day keeps the doctor away.",
"Wednesday: The stout, sweet green fruit keeps me on my toes!",
"Thursday: Another day with the red round fruit. I like to keep the green leaf.",
"Friday: Long yellow fruit day, peel it and it's ready to go."]
banana_words = ['yellow', 'long', 'peel']
apple_words = ['round', 'red', 'green leaf']
pear_words = ['stout', 'sweet', 'green']
keywords = {'banana': banana_words, 'apple': apple_words, 'pear': pear_words}
s = pd.Series(fruit_sentences)
res = pd.DataFrame(columns=[])
res['type'] = pd.Series(list(keywords.keys()))
res['value'] = pd.Series(list(keywords.values())).apply(lambda x: count(x)).sum(axis=1)
print(list(res.itertuples(index=False, name=None)))
python 2.7.11 / pandas 0.17:
import pandas as pd
def count(word_list):
d = pd.Series(word_list).apply(lambda x: s.str.count(x))
return d.sum()
fruit_sentences = ["Monday: Yellow makes me happy. So I eat a long, sweet fruit with a peel.",
"Tuesday: A fruit round red fruit with a green leaf a day keeps the doctor away.",
"Wednesday: The stout, sweet green fruit keeps me on my toes!",
"Thursday: Another day with the red round fruit. I like to keep the green leaf.",
"Friday: Long yellow fruit day, peel it and it's ready to go."]
banana_words = ['yellow', 'long', 'peel']
apple_words = ['round', 'red', 'green leaf']
pear_words = ['stout', 'sweet', 'green']
keywords = {'banana': banana_words, 'apple': apple_words, 'pear': pear_words}
s = pd.Series(fruit_sentences)
res = pd.DataFrame(columns=[])
res['type'] = pd.Series(keywords.keys())
res['value'] = pd.Series(keywords.values()).apply(lambda x: count(x)).sum(axis=1)
print(list(res.itertuples(index=False)))
两者都会给您:
[('banana', 4), ('apple', 6), ('pear', 6)]
答案 3 :(得分:1)
为此,我将使用字典查找(超快速),并使用Counter O(n)创建字典。
# create a dict of look up values
d = {'banana': banana_words, 'apple': apple_words, 'pear':pear_words}
# preprocess data
df['text'] = df['text'].str.lower()
df['text'] = [re.sub(r'[^a-zA-Z0-9\s]','',x) for x in df['text']]
df['text'] = df.text.str.split()
# flatten the list and create a dict
from collections import Counter
my_list = [i for s in df['text'] for i in s]
word_count = Counter(my_list)
# final job
output_dict = {k:len([x for x in v if x in word_count]) for k,v in d.items()}
sorted(output_dict.items(), key=lambda x: x[1])
[('apple', 2), ('banana', 3), ('pear', 3)]