Question

我的脚本在Python 2.7.3和2.7.5+中没有重大缺陷，但无法使用2.7.6。我怀疑它可能与Beautifulsoup如何处理unicode有关，但我不确定。

它基本上是这样的：

# harvest HTML, store it in the variable html
html = harvest()
# the HTML is a string of ascii characters (no extended anything)
soup = BeautifulSoup(html)
trs = soup.find_all('tr',event_attr_id=True)

for tr in trs:
    # do stuff
    # this never executes with python 2.7.6 (it doesn't get it)

pprint(soup.prettify())时，我看到bs包含多个\xa0个字符。我不关心bs对soup的影响，但问题是它不能使用Python 2.7.6。

你能给我一些可能的解决办法吗？编码是否相关？

Answer 1

问题（通过问题推断并不容易）与BeautifulSoup使用的解析器有关，而且解析器不喜欢接收的html（因为它不是＆＃ 39; ta整页但是来自ajax响应的div的内容）。我认为它使用的是html5lib。

我想出了两个有效的解决方案：

修复内置解析器：

soup = BeautifulSoup(''.join('<html><body><table>',html,'</table></body></html>'))

使用不同的解析器：

soup = BeautifulSoup(html,"html.parser")

html5lib不够宽容以接受我正在提供的东西这一事实可能是特定于安装的（即旧版本的html5lib）。

Beautifulsoup4和Python 2.7.6

1 个答案: