Question

我想在网站上找到rss链接。但我的代码也返回img src和css链接，因为它的src包含rss字。

这是我的代码：

import urllib2
import re

website = urllib2.urlopen("http://www.apple.com/rss")
html = website.read()
links = re.findall('"((http)s?://.*rss.*)"',html)
for link in links:
print link

Answer 1

## removing from top
html = re.sub('.*?<div id="container">', "", html)

## remove from bottom
html = re.sub('<div class="callout">.*', "", html)

## then match
links = re.findall('<li[^>]*>\s*<a href="(https?://[^"]*)"', html, re.IGNORECASE)
## you can push the text rss inside the pattern if you want

Answer 2

我不建议使用正则表达式解析HTML。有更好的工具可以在网页上查找链接。我最喜欢的是lxml。

import lxml.html
root = lxml.html.fromstring(html)
links = root.iterlinks()
links.next()

以上将允许您遍历每个链接。然后，您需要推断链接是否引用RSS源。以下是一些可能的方法...

在网址
提出请求并检查回复类型（application/rss+xml）

如果没有实际检查服务器响应，您将无法知道某些内容是否为RSS。像http://www.example.com/f这样的网址可能是RSS Feed。在你检查之前，你无法确定。

使用正则表达式在网页中查找rss链接

2 个答案: