Question

我会尝试详细解释我的需求：

我正在使用feedparser在Python中解析RSS提要。当然，这个Feed有一个项目列表，标题，链接和描述就像一个普通的RSS提要。

另一方面，我有一个字符串列表，其中包含我需要在项目描述中找到的一些关键字。

我需要做的是找到关键字匹配最多的项目

示例：

RSS Feed

<channel>
    <item>
        <title>Lion</title>
        <link>...</link>
        <description>
            The lion (Panthera leo) is one of the four big cats in the genus 
            Panthera, and a member of the family Felidae.
        </description>
    </item>
    <item>
        <title>Panthera</title>
        <link>...</link>
        <description>
            Panthera is a genus of the Felidae (cats), which contains 
            four well-known living species: the tiger, the lion, the jaguar, and the leopard.
        </description>
    </item>
    <item>
        <title>Cat</title>
        <link>...</link>
        <description>
            The domestic cat is a small, usually furry, domesticated, 
            carnivorous mammal. It is often called the housecat, or simply the 
            cat when there is no need to distinguish it from other felids and felines.
        </description>
    </item>
</channel>

关键字列表

['cat', 'lion', 'panthera', 'family']

所以在这种情况下，具有最多（唯一）匹配的项目是第一个，因为它包含所有4个关键字（无关紧要说'猫'而不仅仅是'猫'，我只需要找到字符串中的文字关键字）

让我澄清一下，即使某些描述包含'cat'关键字100次（并且没有其他关键字），这也不会成为赢家，因为我正在寻找包含的关键词，而不是大多数时候出现一个关键字。

现在，我正在循环搜索rss项目并“手动”执行此操作，计算关键字出现的次数（但我遇到了上一段中提到的问题）。

我是Python的新手，我来自不同的语言（C＃），所以如果这非常简单，我很抱歉。

您将如何处理此问题？

Answer 1

texts = [ "The lion (Panthera leo) ...", "Panthera ...", "..." ]
keywords  = ['cat', 'lion', 'panthera', 'family']

# gives the count of `word in text`
def matches(text):
    return sum(word in text.lower() for word in keywords)

# or inline that helper function as a lambda:
# matches = lambda text:sum(word in text.lower() for word in keywords)

# print the one with the highest count of matches
print max(texts, key=matches)

Answer 2

其他答案非常优雅，但对现实世界来说可能过于简单。他们可能破坏的一些方式包括：

部分单词匹配 - 应该'cat'匹配'连接'吗？ “猫”怎么样？
区分大小写 - 应该'猫'匹配'CAT'吗？ “猫”怎么样？

我的解决方案允许这两种情况。

import re

test_text = """
Cat?

The domestic cat is a small, usually furry, domesticated, 
carnivorous mammal. It is often called the housecat, or simply the 
cat when there is no need to distinguish it from other felids and felines.
"""

wordlist = ['cat','lion','feline']
# Construct regexp like r'\W(cat|lionfeline)s?\W'
# Matches cat, lion or feline as a whole word ('cat' matches, 'concatenate'
# does not match)
# also allow for an optional trailing 's', so that both 'cat' and 'cats' will
# match.
wordlist_re = r'\W(' + '|'.join(wordlist) + r')(s?)\W'

# Get list of all matches from text. re.I means "case insensitive".
matches = re.findall(wordlist_re, test_text, re.I)

# Build list of matched words. the `[0]` means first capture group of the regexp
matched_words = [ match[0].lower() for match in matches]

# See which words occurred
unique_matched_words = [word for word in wordlist if word in matched_words]

# Count unique words
num_unique_matched_words = len(unique_matched_words)

输出如下：

>>> wordlist_re
'\\W(cat|lion|feline)(s?)\\W'
>>> matches
[('Cat', ''), ('cat', ''), ('cat', ''), ('feline', 's')]
>>> matched_words
['cat', 'cat', 'cat', 'feline']
>>> unique_matched_words
['cat', 'feline']
>>> num_unique_matched_words
2
>>>

Python：如何在字符串列表中查找包含大多数匹配项的字符串

2 个答案: