Python:如何在字符串列表中查找包含大多数匹配项的字符串

时间:2012-03-14 18:57:47

标签: python string list rss string-matching

我会尝试详细解释我的需求:

我正在使用feedparser在Python中解析RSS提要。当然,这个Feed有一个项目列表,标题,链接和描述就像一个普通的RSS提要。

另一方面,我有一个字符串列表,其中包含我需要在项目描述中找到的一些关键字。

我需要做的是找到关键字匹配最多的项目

示例:

RSS Feed

<channel>
    <item>
        <title>Lion</title>
        <link>...</link>
        <description>
            The lion (Panthera leo) is one of the four big cats in the genus 
            Panthera, and a member of the family Felidae.
        </description>
    </item>
    <item>
        <title>Panthera</title>
        <link>...</link>
        <description>
            Panthera is a genus of the Felidae (cats), which contains 
            four well-known living species: the tiger, the lion, the jaguar, and the leopard.
        </description>
    </item>
    <item>
        <title>Cat</title>
        <link>...</link>
        <description>
            The domestic cat is a small, usually furry, domesticated, 
            carnivorous mammal. It is often called the housecat, or simply the 
            cat when there is no need to distinguish it from other felids and felines.
        </description>
    </item>
</channel>

关键字列表

['cat', 'lion', 'panthera', 'family']

所以在这种情况下,具有最多(唯一)匹配的项目是第一个,因为它包含所有4个关键字(无关紧要说'猫'而不仅仅是'猫',我只需要找到字符串中的文字关键字)

让我澄清一下,即使某些描述包含'cat'关键字100次(并且没有其他关键字),这也不会成为赢家,因为我正在寻找包含的关键词,而不是大多数时候出现一个关键字。

现在,我正在循环搜索rss项目并“手动”执行此操作,计算关键字出现的次数(但我遇到了上一段中提到的问题)。

我是Python的新手,我来自不同的语言(C#),所以如果这非常简单,我很抱歉。

您将如何处理此问题?

2 个答案:

答案 0 :(得分:3)

texts = [ "The lion (Panthera leo) ...", "Panthera ...", "..." ]
keywords  = ['cat', 'lion', 'panthera', 'family']

# gives the count of `word in text`
def matches(text):
    return sum(word in text.lower() for word in keywords)

# or inline that helper function as a lambda:
# matches = lambda text:sum(word in text.lower() for word in keywords)

# print the one with the highest count of matches
print max(texts, key=matches)

答案 1 :(得分:0)

其他答案非常优雅,但对现实世界来说可能过于简单。他们可能破坏的一些方式包括:

  • 部分单词匹配 - 应该'cat'匹配'连接'吗? “猫”怎么样?
  • 区分大小写 - 应该'猫'匹配'CAT'吗? “猫”怎么样?

我的解决方案允许这两种情况。

import re

test_text = """
Cat?

The domestic cat is a small, usually furry, domesticated, 
carnivorous mammal. It is often called the housecat, or simply the 
cat when there is no need to distinguish it from other felids and felines.
"""

wordlist = ['cat','lion','feline']
# Construct regexp like r'\W(cat|lionfeline)s?\W'
# Matches cat, lion or feline as a whole word ('cat' matches, 'concatenate'
# does not match)
# also allow for an optional trailing 's', so that both 'cat' and 'cats' will
# match.
wordlist_re = r'\W(' + '|'.join(wordlist) + r')(s?)\W'

# Get list of all matches from text. re.I means "case insensitive".
matches = re.findall(wordlist_re, test_text, re.I)

# Build list of matched words. the `[0]` means first capture group of the regexp
matched_words = [ match[0].lower() for match in matches]

# See which words occurred
unique_matched_words = [word for word in wordlist if word in matched_words]

# Count unique words
num_unique_matched_words = len(unique_matched_words)

输出如下:

>>> wordlist_re
'\\W(cat|lion|feline)(s?)\\W'
>>> matches
[('Cat', ''), ('cat', ''), ('cat', ''), ('feline', 's')]
>>> matched_words
['cat', 'cat', 'cat', 'feline']
>>> unique_matched_words
['cat', 'feline']
>>> num_unique_matched_words
2
>>>