Question

我想从某个网站上抓取HTML，然后将其发送到BeautifulSoup进行解析。问题是urllib2.urlopen（）返回的HTML包含换行符（\ n）和制表符（\ t）以及单引号和其他字符转义。当我尝试使用此HTML构建BeautifulSoup对象时，出现错误。

b = BeautifulSoup(src)

给出this error。

我的代码：

def get_page_source(url):
    """
    Retrieves the HTML source code for url.
    """
    try:
        return urllib2.urlopen(url)
    except:
        return ""


def retrieve_links(url):
    """
    Use the BeautifulSoup module to efficiently grab all links from the source
    code retrieved by get_page_source.
    """
    src = get_page_source(url)   
    b = BeautifulSoup(src)

    .
    .
    .

我该如何解决这个问题？

修改

import urllib2

link = "http://www.techcrunch.com/"
src = urllib2.urlopen(link).read()

f = open('out.txt', 'w')
f.write(src)
f.close()

给出this output。

Answer 1

问题是您正在解析的HTML包含嵌入式JavaScript代码（BeautifulSoup错误抱怨第130行，它位于嵌入式JavaScript的中间），JavaScript包含嵌入式HTML。

第130行，注意<a>标记：

adNode += "<a href='http://t.aol.com?ncid=...

它是HTML和JavaScript的Matryoshka doll，Python的内置解析器无法处理它。

您可以按照BeatifulSoup本身在您发布的错误消息中提供的安装解析器的说明进行操作：

Python的内置HTMLParser无法解析给定的文档。这不是Beautiful Soup中的错误。最好的解决方案是安装外部解析器（lxml或html5lib），并使用Beautiful Soup和该解析器。请参阅http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser以获取帮助。

Urllib2使用换行符和标签返回HTML

1 个答案: