Question

这是python 2.7.2中的一个简单代码，它获取站点并从给定站点获取所有链接：

import urllib2
from bs4 import BeautifulSoup

def getAllLinks(url):
    response = urllib2.urlopen(url)
    content = response.read()
    soup = BeautifulSoup(content, "html5lib")
    return soup.find_all("a")

links1 = getAllLinks('http://www.stanford.edu')
links2 = getAllLinks('http://med.stanford.edu/')

print len(links1)
print len(links2)

问题是它在第二种情况下不起作用。它打印102和0，而第二个网站上有明显的链接。 BeautifulSoup不会抛出解析错误，它可以很好地打印标记。我怀疑它可能是由med.stanford.edu来源的第一行引起的，它说它是xml（即使内容类型是：text / html）：

<?xml version="1.0" encoding="iso-8859-1"?>

我无法弄清楚如何设置Beautiful以忽略它或解决方法。我使用html5lib作为解析器，因为我遇到了默认问题（标记不正确）。

Answer 1

当文档声称是XML时，我发现lxml解析器可以提供最佳结果。尝试使用代码，但使用lxml解析器而不是html5lib可以找到300个链接。

Answer 2

问题是<?xml...行是完全正确的。忽略它非常简单：只需跳过第一行内容，替换

    content = response.read()

类似

    content = "\n".join(response.readlines()[1:])

此更改后，len(links2)变为300。

ETA：您可能希望有条件地执行此操作，因此您不必总是跳过第一行内容。一个例子是：

content = response.read()
if content.startswith("<?xml"):
    content = "\n".join(content.split("\n")[1:])

BeautifulSoup错误地解析页面，但找不到链接

2 个答案: