Question

说我有一个关键词列表（大约300个）

Key Word
abduct
attack
airstrike
bomb

我想迭代整个DataFrame（df1）列（Text），以便找到关键字出现的任何实例。我的最终目标是为每个关键词计算总数。

Text                                Location     Date 
Police have just discovered a bomb. New York    4/30/2015, 23:54:27  
...

我知道我可以使用str.contains（见下文）来逐个查找每个单词的总数，但我正在寻找一种简单的方法来同时计算总数。

word_count = df1[df1['Text'].str.contains('Key Word').count()

我还试图通过一个脚本来解决我的问题，该脚本将“文本”中的所有数据分成单个关键词并总计总和，但这没有考虑任何有空格的关键词（至少在目前的形式）。

 In [31]: df.Text.str.lower().apply(lambda x: pd.value_counts(x.split(" "))).sum(axis =0)

非常感谢任何帮助！

Answer 1

似乎您希望将所有文本拆分为单个单词列表，然后只扫描一次列表，使用dict计算出现的次数。你可以从

开始

word_list = (df1.Text + ' ').sum().split()

这将给出列中所有单词的单个列表。向每个空间添加空间可防止连续条目的连接。然后扫描列表计算关键字：

word_count = dict((keyword, 0) for keyword in keywords)
for word in wordlist:
     try:
         word_count += 1
     except KeyError:
         pass

dict查找是O（1），你只需要扫描wordlist一次，这样它的算法合理。我现在能想到的唯一问题是带有多个单词的关键词。但是，您可以简单地将构成关键词（短语）的单词视为关键词并对其进行计数。然后推断出关键短语的频率。这并不完美，但如果构成关键短语的单词之间没有重叠，它将起作用，并且仍然可以根据重叠进行操作。我想这就足够了，但是如果没有看到所有的关键词，我就无法知道。

编辑：我想到了一种方法，只使用pandas：

来做同样的事情

word_series = pd.Series((df1.Text + ' ').sum().split())
word_series.value_counts().loc[key_words]

这将为您提供每个关键字的出现次数。它仍然无法解决关键短语问题。

但是，这是一个适用于双字关键短语的解决方案：

two_word_series = word_series + ' ' word_series.shift(-1)
# a series of all consecutive pairs in the word_series
two_word_series.value_counts().loc[two_word_key_phrases]

这可以推广到n字短语，但一段时间后会变得很麻烦。它的可行性取决于关键短语的最大长度。

Answer 2

如果您想要一个可以包含特定短语（您事先知道的）的解决方案，您可以将短语中的空格替换为另一个字符（例如＆＃34; _＆＃34;）。例如：

import pandas as pd
from collections import Counter

df = pd.DataFrame(['Police have discovered an air bomb', 'Air strike the bomb', 'The air strike police are going on strike', 'Air bomb is full of hot air'], columns = ['text'])
keywords = ['bomb', 'police', 'air strike']
keyword_dict = {w:w.replace(' ', '_') for w in keywords}

corpus = ' '.join(df.text).lower()
for w,w2 in keyword_dict.items():
   corpus = corpus.replace(w,w2)

all_counts = Counter(corpus.split())
final_counts = {w:all_counts[w2] for w,w2 in keyword_dict.items()}
print(final_counts)
{'police': 1, 'air strike': 1, 'bomb': 2}

一个更通用的解决方案（从文本挖掘的角度来看，可能更好的练习，你必须事先知道你正在寻找的短语），你可以从文本中提取所有的双字母并执行统计整件事：

corpus = ' '.join(df.text).lower()
words = corpus.split()
bigrams = [' '.join([words[i],words[i+1]]) for i in range(len(words) -1)]
print(Counter(words + bigrams))
Counter({'air': 5, 'bomb': 3, 'strike': 3, 'air strike': 2, 'police': 2, 'air bomb': 2, 'the': 2, 'discovered': 1, 'bomb is': 1, 'the bomb': 1, 'have discovered': 1, 'full': 1, 'bomb the': 1, 'going on': 1, 'are going': 1, 'are': 1, 'discovered an': 1, 'the air': 1, 'hot air': 1, 'is full': 1, 'hot': 1, 'on strike': 1, 'is': 1, 'strike the': 1, 'police have': 1, 'bomb air': 1, 'of': 1, 'strike police': 1, 'of hot': 1, 'an': 1, 'strike air': 1, 'on': 1, 'full of': 1, 'police are': 1, 'have': 1, 'going': 1, 'an air': 1})

在Pandas DataFrame中，Sum部分字符串（关键字）匹配

2 个答案: