BeautifulSoup在jQuery脚本上窒息,任何已知的解决方法?

时间:2010-11-14 17:53:53

标签: html xml screen-scraping html-parsing beautifulsoup

我正在为BeautifulSoup提供一个html文档,只需用完整的html构建一个BeautifulSoup对象实例,它似乎扼杀了嵌入在html中的jQuery脚本的以下行:

        var txt = "Logged in as: <a href=\"http://somedomain.com/the-blah/\">" + uname + "</a> <small>(<a href=\"http://somedomain.com/the-blah/\">The Blah</a> | <a href=\"http://somedomain.com/the-blah/?action=logout\">logout</a>)</small>";

错误的完整堆栈跟踪如下:

    /usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.pyc in __init__(self, *args, **kwargs)
   1497             kwargs['smartQuotesTo'] = self.HTML_ENTITIES
   1498         kwargs['isHTML'] = True
-> 1499         BeautifulStoneSoup.__init__(self, *args, **kwargs)
   1500 
   1501     SELF_CLOSING_TAGS = buildTagMap(None,

/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.pyc in __init__(self, markup, parseOnlyThese, fromEncoding, markupMassage, smartQuotesTo, convertEntities, selfClosingTags, isHTML, builder)
   1228         self.markupMassage = markupMassage
   1229         try:
-> 1230             self._feed(isHTML=isHTML)
   1231         except StopParsing:
   1232             pass

/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.pyc in _feed(self, inDocumentEncoding, isHTML)
   1261         self.builder.reset()
   1262 
-> 1263         self.builder.feed(markup)
   1264         # Close out any unfinished strings and close all the open tags.

   1265         self.endData()

/usr/lib/python2.6/HTMLParser.pyc in feed(self, data)
    106         """
    107         self.rawdata = self.rawdata + data
--> 108         self.goahead(0)
    109 
    110     def close(self):

/usr/lib/python2.6/HTMLParser.pyc in goahead(self, end)
    146             if startswith('<', i):
    147                 if starttagopen.match(rawdata, i): # < + letter
--> 148                     k = self.parse_starttag(i)
    149                 elif startswith("</", i):
    150                     k = self.parse_endtag(i)

/usr/lib/python2.6/HTMLParser.pyc in parse_starttag(self, i)
    227     def parse_starttag(self, i):
    228         self.__starttag_text = None
--> 229         endpos = self.check_for_whole_start_tag(i)
    230         if endpos < 0:
    231             return endpos

/usr/lib/python2.6/HTMLParser.pyc in check_for_whole_start_tag(self, i)
    302                 return -1
    303             self.updatepos(i, j)
--> 304             self.error("malformed start tag")
    305         raise AssertionError("we should not get here!")
    306 

/usr/lib/python2.6/HTMLParser.pyc in error(self, message)
    113 
    114     def error(self, message):
--> 115         raise HTMLParseError(message, self.getpos())
    116 
    117     __starttag_text = None

HTMLParseError: malformed start tag, at line 193, column 110

从我可以收集的内容中,它与尖括号在引号内有关,似乎被它抛弃了。那里有什么样的工作,还是有另一个库可以更好地处理这些边缘情况?或者,有没有办法告诉它忽略所有的javascript内容?

1 个答案:

答案 0 :(得分:1)

最简单的方法可能是删除所有脚本。请参阅文档中的删除元素部分:http://www.crummy.com/software/BeautifulSoup/documentation.html#Removing%20elements