Question

我有一个字符串列表，这些字符串来自不同的电子邮件会话。我想看看是否有经常使用的单词或单词组合。

示例列表将是：

subjects = [
              'Proposal to cooperate - Company Name',
              'Company Name Introduction',
              'Into Other Firm / Company Name',
              'Request for Proposal'
           ]

该功能必须检测到“公司名称”，因为组合被多次使用，并且“Proposal”被多次使用。这些单词不会提前知道，所以我想它必须开始尝试所有可能的组合。

实际列表当然比这个例子长很多，所以手动尝试所有组合似乎不是最好的方法。什么是最好的方法呢？

更新

我已经使用了Tim Pietzcker的答案开始为此开发一个函数，但我仍然无法正确应用Counter。它不断返回列表的长度作为所有短语的计数。

短语功能，包括标点符号过滤器和检查此短语是否已被检查，以及每个短语的最大长度为3个单词：

def phrases(string, phrase_list):
  words = string.split()
  result = []
  punctuation = '\'\"-_,.:;!? '
  for number in range(len(words)):
      for start in range(len(words)-number):
        if number+1 <= 3:
          phrase = " ".join(words[start:start+number+1])
          if phrase in phrase_list:
            pass
          else:
            phrase_list.append(phrase)
            phrase = phrase.strip(punctuation).lower()
            if phrase:
               result.append(phrase)
  return result, phrase_list

然后循环通过主题列表：

phrase_list = []
ranking = {}
for s in subjects:
    result, phrase_list = phrases(s, phrase_list)
    all_phrases = collections.Counter(phrase.lower() for s in subjects for phrase in result)

“all_phrases”返回一个带有元组的列表，其中每个计数值为167，这是我正在使用的主题列表的长度。不知道我在这里缺少什么...

Answer 1

您还希望查找由多个单词组成的短语。没问题。这甚至可以很好地扩展。

import collections

subjects = [
              'Proposal to cooperate - Company Name',
              'Company Name Introduction',
              'Into Other Firm / Company Name',
              'Request for Proposal',
              'Some more Firm / Company Names'
           ]

def phrases(string):
    words = string.split()
    result = []
    for number in range(len(words)):
        for start in range(len(words)-number):
             result.append(" ".join(words[start:start+number+1]))
    return result

函数phrases()将输入字符串拆分为空格并返回任何长度的所有可能子字符串：

In [2]: phrases("A Day in the Life")
Out[2]:
['A',
 'Day',
 'in',
 'the',
 'Life',
 'A Day',
 'Day in',
 'in the',
 'the Life',
 'A Day in',
 'Day in the',
 'in the Life',
 'A Day in the',
 'Day in the Life',
 'A Day in the Life']

现在，您可以计算在所有主题中找到每个短语的次数：

all_phrases = collections.Counter(phrase for subject in subjects for phrase in phrases(subject))

结果：

In [3]: print([(phrase, count) for phrase, count in all_phrases.items() if count > 1])
Out [3]:
[('Company', 4), ('Proposal', 2), ('Firm', 2), ('Name', 3), ('Company Name', 3), 
 ('Firm /', 2), ('/', 2), ('/ Company', 2), ('Firm / Company', 2)]

请注意，您可能希望使用其他条件，而不仅仅是在空格上拆分，可能会忽略标点符号和大小写等。

Answer 2

我建议您使用空格作为分隔符，否则如果您没有指定允许的短语如何，则有太多可能性。应该看起来像。

要计算单词出现次数，您可以使用collections模块中的Counter：

import operator
from collections import Counter

d = Counter(' '.join(subjects).split())

# create a list of tuples, ordered by occurrence frequency
sorted_d = sorted(d.items(), key=operator.itemgetter(1), reverse=True)

# print all entries that occur more than once
for x in sorted_d:
    if x[1] > 1:
        print(x[1], x[0])

输出：

3 Name
3 Company
2 Proposal

Answer 3

类似于pp _的答案。使用拆分。

import operator

subjects = [
          'Proposal to cooperate - Company Name',
          'Company Name Introduction',
          'Into Other Firm / Company Name',
          'Request for Proposal'
       ]
flat_list = [item for i in subjects for item in i.split() ]
count_dict = {i:flat_list.count(i) for i in flat_list}
sorted_dict = sorted(count_dict.items(), reverse=True, key=operator.itemgetter(1))

输出：

[('Name', 3),
('Company', 3),
('Proposal', 2),
('Other', 1),
('/', 1),
('for', 1),
('cooperate', 1),
('Request', 1),
('Introduction', 1),
('Into', 1),
('-', 1),
('to', 1),
('Firm', 1)]

Python：在字符串列表

3 个答案: