我正在为BeautifulSoup提供一个html文档,只需用完整的html构建一个BeautifulSoup对象实例,它似乎扼杀了嵌入在html中的jQuery脚本的以下行:
var txt = "Logged in as: <a href=\"http://somedomain.com/the-blah/\">" + uname + "</a> <small>(<a href=\"http://somedomain.com/the-blah/\">The Blah</a> | <a href=\"http://somedomain.com/the-blah/?action=logout\">logout</a>)</small>";
错误的完整堆栈跟踪如下:
/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.pyc in __init__(self, *args, **kwargs)
1497 kwargs['smartQuotesTo'] = self.HTML_ENTITIES
1498 kwargs['isHTML'] = True
-> 1499 BeautifulStoneSoup.__init__(self, *args, **kwargs)
1500
1501 SELF_CLOSING_TAGS = buildTagMap(None,
/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.pyc in __init__(self, markup, parseOnlyThese, fromEncoding, markupMassage, smartQuotesTo, convertEntities, selfClosingTags, isHTML, builder)
1228 self.markupMassage = markupMassage
1229 try:
-> 1230 self._feed(isHTML=isHTML)
1231 except StopParsing:
1232 pass
/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.pyc in _feed(self, inDocumentEncoding, isHTML)
1261 self.builder.reset()
1262
-> 1263 self.builder.feed(markup)
1264 # Close out any unfinished strings and close all the open tags.
1265 self.endData()
/usr/lib/python2.6/HTMLParser.pyc in feed(self, data)
106 """
107 self.rawdata = self.rawdata + data
--> 108 self.goahead(0)
109
110 def close(self):
/usr/lib/python2.6/HTMLParser.pyc in goahead(self, end)
146 if startswith('<', i):
147 if starttagopen.match(rawdata, i): # < + letter
--> 148 k = self.parse_starttag(i)
149 elif startswith("</", i):
150 k = self.parse_endtag(i)
/usr/lib/python2.6/HTMLParser.pyc in parse_starttag(self, i)
227 def parse_starttag(self, i):
228 self.__starttag_text = None
--> 229 endpos = self.check_for_whole_start_tag(i)
230 if endpos < 0:
231 return endpos
/usr/lib/python2.6/HTMLParser.pyc in check_for_whole_start_tag(self, i)
302 return -1
303 self.updatepos(i, j)
--> 304 self.error("malformed start tag")
305 raise AssertionError("we should not get here!")
306
/usr/lib/python2.6/HTMLParser.pyc in error(self, message)
113
114 def error(self, message):
--> 115 raise HTMLParseError(message, self.getpos())
116
117 __starttag_text = None
HTMLParseError: malformed start tag, at line 193, column 110
从我可以收集的内容中,它与尖括号在引号内有关,似乎被它抛弃了。那里有什么样的工作,还是有另一个库可以更好地处理这些边缘情况?或者,有没有办法告诉它忽略所有的javascript内容?
答案 0 :(得分:1)
最简单的方法可能是删除所有脚本。请参阅文档中的删除元素部分:http://www.crummy.com/software/BeautifulSoup/documentation.html#Removing%20elements