解析电子邮件以识别关键字

时间:2018-08-15 16:43:54

标签: python text

我正在寻找解析电子邮件文本列表以识别关键字的方法。可以说我有以下列表:

sentences = [['this is a paragraph there should be lots more words here'],
 ['more information in this one'],
 ['just more words to be honest, not sure what to write']]

我想使用正则表达式检查关键字列表中的单词是否在列表中的这些句子中。我不想informations被捕获,只有information

keywords = ['information', 'boxes', 'porcupine']

正在尝试执行以下操作:

['words' in words for [word for word in [sentence for sentence in sentences]]

for sentence in sentences:
    sentence.split(' ')

最终想将当前列表过滤为仅具有我指定的关键字的元素。

keywords = ['information', 'boxes']

sentences = [['this is a paragraph there should be lots more words here'],
     ['more information in this one'],
     ['just more words to be honest, not sure what to write']]

output: [False, True, False]

或最终:

parsed_list = [['more information in this one']]

4 个答案:

答案 0 :(得分:1)

这里是一线解决您的问题的方法。我发现使用lambda语法比嵌套列表理解更易于阅读。

keywords = ['information', 'boxes']

sentences = [['this is a paragraph there should be lots more words here'],
             ['more information in this one'],
             ['just more words to be honest, not sure what to write']]


results_lambda = list(
    filter(lambda sentence: any((word in sentence[0] for word in keywords)), sentences))

print(results_lambda)

[['more information in this one']]

答案 1 :(得分:0)

您要查找包含关键字列表中所有单词的句子吗?

如果是这样,则可以使用一组这些关键字并根据列表中是否存在所有单词来过滤每个句子:

一种方法是:

keyword_set = set(keywords)
n = len(keyword_set) # number of keywords
def allKeywdsPresent(sentence):
    return len(set(sentence.split(" ")) & keyword_set) == n # the intersection of both sets should equal the keyword set

filtered = [sentence for sentence in sentences if allKeywdsPresent(sentence)]

# filtered is the final set of sentences which satisfy your condition

# if you want a list of booleans:
boolean_array = [allKeywdsPresent(sentence[0]) for sentence in sentences]

可能会有更多的最佳方法(例如,为allKeywdsPresent中的每个句子创建的集合可以用一次遍历所有元素的方式替换,等等)。但这是一个开始。

另外,请理解,使用set意味着将消除关键字列表中的重复项。因此,如果您有一个包含重复项的关键字列表,请使用dict而不是set来保留每个关键字的计数并在逻辑上重复使用。

从您的示例来看,至少有一个关键字匹配就足够了。然后,您需要修改allKeywdsPresent()[也许重命名为anyKeywdsPresent]:

def allKeywdsPresent(sentence):
   return any(word in keyword_set for word in sentence.split()) 

答案 2 :(得分:0)

这可以通过快速的列表理解来完成!

lists = [['here is one sentence'], ['and here is another'], ['let us filter!'], ['more than one word filter']] filter = ['filter', 'one'] result = list(set([x for s in filter for x in lists if s in x[0]])) print(result)

结果: [['let us filter!'], ['more than one word filter'], ['here is one sentence']] 希望这会有所帮助!

答案 3 :(得分:0)

如果您只想匹配整个单词,而不仅仅是子字符串,则必须考虑所有单词分隔符(空格,修饰语等),然后将句子拆分成单词,然后将它们与关键字匹配。最简单但不是万无一失的方法是只使用正则表达式\W(非单词字符)分类器,并在出现这种情况时拆分句子。

一旦您拥有文本中的单词列表和匹配的关键字列表,查看是否存在匹配的最简单,可能也是最高效的方法就是设置两者之间的交集。所以:

# not sure why you have the sentences in single-element lists, but if you insist...
sentences = [['this is a paragraph there should be lots more words here'],
             ['more information in this one'],
             ['just more disinformation, to make sure we have no partial matches']]

keywords = {'information', 'boxes', 'porcupine'}  # note we're using a set here!

WORD = re.compile(r"\W+")  # a simple regex to split sentences into words

# finally, iterate over each sentence, split it into words and check for intersection
result = [s for s in sentences if set(WORD.split(s[0].lower())) & keywords]
# [['more information in this one']]

所以,它是如何工作的-简单,我们遍历每个句子(并用小写形式表示,以区分大小写),然后使用上述正则表达式将句子拆分为单词。这意味着,例如,第一句话将拆分为:

['this', 'is', 'a', 'paragraph', 'there', 'should', 'be', 'lots', 'more', 'words', 'here']

然后,我们将其转换为用于快速比较的集合(set是一个哈希序列,基于哈希的交集非常快),并且,此外,它还能消除重复的单词。

最后,我们对keywords进行集合交集-如果返回任何内容,则这两个集合至少有一个共同的词,这意味着if ...的比较结果为True在这种情况下,当前句子将被添加到结果中。

最后的注释-请注意,尽管\W+足以将句子拆分为单词(肯定比仅用空格拆分要好),但这远非十全十美,也不是真的适合所有语言。如果您认真对待文字处理,请查看一些适用于Python的NLP模块,例如nltk