检查RSS提要标题的单词。仅打印包含单词的标题

时间:2015-12-17 19:04:59

标签: python regex rss

我正在尝试构建一个RSS-Parser,它将检查关键字的每个标题。所以我只得到我感兴趣的提要。到目前为止,我能够使用正则表达式获得标题。但我不确定如何继续。我想检查多个关键字的标题,因此最好从.txt文件加载它们。我只希望打印出具有正面匹配的标题。有人能指出我正确的方向吗?

到目前为止我的代码:

import urllib2
from urllib2 import urlopen
import re
import cookielib
from cookielib import CookieJar
import time
# -*- coding: utf-8 -*-

cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.addheaders = [('User-agent', 'Mozilla/5.0')]

def main():
    try:
        page = 'http://randomdomainXYZ.com/news-feed.xml'
        sourceCode = opener.open(page).read()
        #print sourceCode

        try:
            titles = re.findall(r'<title>(.*?)</title>', sourceCode)
            for title in titles:
                print title

        except Exception, e:
            print str(e)

    except Exception, e:
        print str(e)

main()

1 个答案:

答案 0 :(得分:1)

因此,您要打印包含某个列表中某个单词的标题。尝试:

for title in titles:
    if any(word in title for word in word_list):
        print title

至于阅读单词列表,您可以阅读文件中的所有行:

with open('word_list.txt') as f:
    word_list = f.readlines()

# Make sure words don't end with a newline character ('\n')
word_list = [word.strip() for word in word_list]