我正在尝试构建一个RSS-Parser,它将检查关键字的每个标题。所以我只得到我感兴趣的提要。到目前为止,我能够使用正则表达式获得标题。但我不确定如何继续。我想检查多个关键字的标题,因此最好从.txt文件加载它们。我只希望打印出具有正面匹配的标题。有人能指出我正确的方向吗?
到目前为止我的代码:
import urllib2
from urllib2 import urlopen
import re
import cookielib
from cookielib import CookieJar
import time
# -*- coding: utf-8 -*-
cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
def main():
try:
page = 'http://randomdomainXYZ.com/news-feed.xml'
sourceCode = opener.open(page).read()
#print sourceCode
try:
titles = re.findall(r'<title>(.*?)</title>', sourceCode)
for title in titles:
print title
except Exception, e:
print str(e)
except Exception, e:
print str(e)
main()
答案 0 :(得分:1)
因此,您要打印包含某个列表中某个单词的标题。尝试:
for title in titles:
if any(word in title for word in word_list):
print title
至于阅读单词列表,您可以阅读文件中的所有行:
with open('word_list.txt') as f:
word_list = f.readlines()
# Make sure words don't end with a newline character ('\n')
word_list = [word.strip() for word in word_list]