Question

我开始玩BeautifulSoup，但它不起作用。只是尝试获取find_all('a')的所有链接，响应时间始终为[]或null。问题可能是由iso / utf-8编码或格式错误的html引起的，对吧？

我发现如果我只在<body></body>代码之间使用代码，它就可以正常运行，因此我们可以放弃编码。

那该怎么办？是否有汤内置函数来修复格式错误的HTML？也许使用RE来获取<body>内容？有线索吗？它可能是一个常见的问题...

顺便说一下，我处理葡萄牙语（pt_BR）语言，Win64，Python27和示例无效页面是http://www.tudogostoso.com.br/

编辑：到目前为止我做了什么

#im using mechanize
br = mechanize.Browser()
site = 'http://www.tudogostoso.com.br/'
r = br.open(site)

#returned html IS OK. outputed and tested a lot
html = r.read()

soup = BeautifulSoup(html)

for a in soup.find_all('a', href=True):
    print "Found the URL:", a['href']

#nothing happens
#but if html = <body>...</body> (cropped manually) its works and prints all the links

Answer 1

感谢@abarnert

html5lib可以处理畸形。此外，HTML5有一些新的特性，可能会像我这样的人或者甚至是较旧的解析器看起来都会出现畸形，例如BeautifulSoup默认使用的解析器。它们并非真正的畸形。

所以，最后，使用

soup = BeautifulSoup(html, "html5lib")

而不仅仅是

soup = BeautifulSoup(html)

刚刚做到了！

Answer 2

要下载页面，请使用某个模块requests或urllib2。

Requests模块：

import requests
r = requests.get('http://www.tudogostoso.com.br/')
html = r.content
soup = BeautifulSoup(html)
for a in soup.find_all('a', href=True):
    print "Found the URL:", a['href']

urllib2：

import urllib2
r = urllib2.urlopen('http://www.tudogostoso.com.br/')
html = r.read()
soup = BeautifulSoup(html)
for a in soup.find_all('a', href=True):
    print "Found the URL:", a['href']

BeautifulSoup没有处理格式错误的utf-8 HTML

2 个答案: