我会尝试详细解释我的需求:
我正在使用feedparser在Python中解析RSS提要。当然,这个Feed有一个项目列表,标题,链接和描述就像一个普通的RSS提要。
另一方面,我有一个字符串列表,其中包含我需要在项目描述中找到的一些关键字。
我需要做的是找到关键字匹配最多的项目
示例:
RSS Feed
<channel>
<item>
<title>Lion</title>
<link>...</link>
<description>
The lion (Panthera leo) is one of the four big cats in the genus
Panthera, and a member of the family Felidae.
</description>
</item>
<item>
<title>Panthera</title>
<link>...</link>
<description>
Panthera is a genus of the Felidae (cats), which contains
four well-known living species: the tiger, the lion, the jaguar, and the leopard.
</description>
</item>
<item>
<title>Cat</title>
<link>...</link>
<description>
The domestic cat is a small, usually furry, domesticated,
carnivorous mammal. It is often called the housecat, or simply the
cat when there is no need to distinguish it from other felids and felines.
</description>
</item>
</channel>
关键字列表
['cat', 'lion', 'panthera', 'family']
所以在这种情况下,具有最多(唯一)匹配的项目是第一个,因为它包含所有4个关键字(无关紧要说'猫'而不仅仅是'猫',我只需要找到字符串中的文字关键字)
让我澄清一下,即使某些描述包含'cat'关键字100次(并且没有其他关键字),这也不会成为赢家,因为我正在寻找包含的关键词,而不是大多数时候出现一个关键字。
现在,我正在循环搜索rss项目并“手动”执行此操作,计算关键字出现的次数(但我遇到了上一段中提到的问题)。
我是Python的新手,我来自不同的语言(C#),所以如果这非常简单,我很抱歉。
您将如何处理此问题?
答案 0 :(得分:3)
texts = [ "The lion (Panthera leo) ...", "Panthera ...", "..." ]
keywords = ['cat', 'lion', 'panthera', 'family']
# gives the count of `word in text`
def matches(text):
return sum(word in text.lower() for word in keywords)
# or inline that helper function as a lambda:
# matches = lambda text:sum(word in text.lower() for word in keywords)
# print the one with the highest count of matches
print max(texts, key=matches)
答案 1 :(得分:0)
其他答案非常优雅,但对现实世界来说可能过于简单。他们可能破坏的一些方式包括:
我的解决方案允许这两种情况。
import re
test_text = """
Cat?
The domestic cat is a small, usually furry, domesticated,
carnivorous mammal. It is often called the housecat, or simply the
cat when there is no need to distinguish it from other felids and felines.
"""
wordlist = ['cat','lion','feline']
# Construct regexp like r'\W(cat|lionfeline)s?\W'
# Matches cat, lion or feline as a whole word ('cat' matches, 'concatenate'
# does not match)
# also allow for an optional trailing 's', so that both 'cat' and 'cats' will
# match.
wordlist_re = r'\W(' + '|'.join(wordlist) + r')(s?)\W'
# Get list of all matches from text. re.I means "case insensitive".
matches = re.findall(wordlist_re, test_text, re.I)
# Build list of matched words. the `[0]` means first capture group of the regexp
matched_words = [ match[0].lower() for match in matches]
# See which words occurred
unique_matched_words = [word for word in wordlist if word in matched_words]
# Count unique words
num_unique_matched_words = len(unique_matched_words)
输出如下:
>>> wordlist_re
'\\W(cat|lion|feline)(s?)\\W'
>>> matches
[('Cat', ''), ('cat', ''), ('cat', ''), ('feline', 's')]
>>> matched_words
['cat', 'cat', 'cat', 'feline']
>>> unique_matched_words
['cat', 'feline']
>>> num_unique_matched_words
2
>>>