我有一个字符串列表,这些字符串来自不同的电子邮件会话。我想看看是否有经常使用的单词或单词组合。
示例列表将是:
subjects = [
'Proposal to cooperate - Company Name',
'Company Name Introduction',
'Into Other Firm / Company Name',
'Request for Proposal'
]
该功能必须检测到“公司名称”,因为组合被多次使用,并且“Proposal”被多次使用。这些单词不会提前知道,所以我想它必须开始尝试所有可能的组合。
实际列表当然比这个例子长很多,所以手动尝试所有组合似乎不是最好的方法。什么是最好的方法呢?
更新
我已经使用了Tim Pietzcker的答案开始为此开发一个函数,但我仍然无法正确应用Counter。它不断返回列表的长度作为所有短语的计数。
短语功能,包括标点符号过滤器和检查此短语是否已被检查,以及每个短语的最大长度为3个单词:
def phrases(string, phrase_list):
words = string.split()
result = []
punctuation = '\'\"-_,.:;!? '
for number in range(len(words)):
for start in range(len(words)-number):
if number+1 <= 3:
phrase = " ".join(words[start:start+number+1])
if phrase in phrase_list:
pass
else:
phrase_list.append(phrase)
phrase = phrase.strip(punctuation).lower()
if phrase:
result.append(phrase)
return result, phrase_list
然后循环通过主题列表:
phrase_list = []
ranking = {}
for s in subjects:
result, phrase_list = phrases(s, phrase_list)
all_phrases = collections.Counter(phrase.lower() for s in subjects for phrase in result)
“all_phrases”返回一个带有元组的列表,其中每个计数值为167,这是我正在使用的主题列表的长度。不知道我在这里缺少什么...
答案 0 :(得分:1)
您还希望查找由多个单词组成的短语。没问题。这甚至可以很好地扩展。
import collections
subjects = [
'Proposal to cooperate - Company Name',
'Company Name Introduction',
'Into Other Firm / Company Name',
'Request for Proposal',
'Some more Firm / Company Names'
]
def phrases(string):
words = string.split()
result = []
for number in range(len(words)):
for start in range(len(words)-number):
result.append(" ".join(words[start:start+number+1]))
return result
函数phrases()
将输入字符串拆分为空格并返回任何长度的所有可能子字符串:
In [2]: phrases("A Day in the Life")
Out[2]:
['A',
'Day',
'in',
'the',
'Life',
'A Day',
'Day in',
'in the',
'the Life',
'A Day in',
'Day in the',
'in the Life',
'A Day in the',
'Day in the Life',
'A Day in the Life']
现在,您可以计算在所有主题中找到每个短语的次数:
all_phrases = collections.Counter(phrase for subject in subjects for phrase in phrases(subject))
结果:
In [3]: print([(phrase, count) for phrase, count in all_phrases.items() if count > 1])
Out [3]:
[('Company', 4), ('Proposal', 2), ('Firm', 2), ('Name', 3), ('Company Name', 3),
('Firm /', 2), ('/', 2), ('/ Company', 2), ('Firm / Company', 2)]
请注意,您可能希望使用其他条件,而不仅仅是在空格上拆分,可能会忽略标点符号和大小写等。
答案 1 :(得分:0)
我建议您使用空格作为分隔符,否则如果您没有指定允许的短语如何,则有太多可能性。应该看起来像。
要计算单词出现次数,您可以使用collections
模块中的Counter
:
import operator
from collections import Counter
d = Counter(' '.join(subjects).split())
# create a list of tuples, ordered by occurrence frequency
sorted_d = sorted(d.items(), key=operator.itemgetter(1), reverse=True)
# print all entries that occur more than once
for x in sorted_d:
if x[1] > 1:
print(x[1], x[0])
输出:
3 Name
3 Company
2 Proposal
答案 2 :(得分:0)
类似于pp _的答案。使用拆分。
import operator
subjects = [
'Proposal to cooperate - Company Name',
'Company Name Introduction',
'Into Other Firm / Company Name',
'Request for Proposal'
]
flat_list = [item for i in subjects for item in i.split() ]
count_dict = {i:flat_list.count(i) for i in flat_list}
sorted_dict = sorted(count_dict.items(), reverse=True, key=operator.itemgetter(1))
输出:
[('Name', 3),
('Company', 3),
('Proposal', 2),
('Other', 1),
('/', 1),
('for', 1),
('cooperate', 1),
('Request', 1),
('Introduction', 1),
('Into', 1),
('-', 1),
('to', 1),
('Firm', 1)]