>>> soup = BeautifulSoup( data )
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1499, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1230, in __init__
self._feed(isHTML=isHTML)
File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1263, in _feed
self.builder.feed(markup)
File "/usr/lib/python2.6/HTMLParser.py", line 108, in feed
self.goahead(0)
File "/usr/lib/python2.6/HTMLParser.py", line 148, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.6/HTMLParser.py", line 226, in parse_starttag
endpos = self.check_for_whole_start_tag(i)
File "/usr/lib/python2.6/HTMLParser.py", line 301, in check_for_whole_start_tag
self.error("malformed start tag")
File "/usr/lib/python2.6/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: malformed start tag, at line 5518, column 822
>>> for each in l[5515:5520]:
... print each
...
<script>
registerImage("original_image", "http://ecx.images-amazon.com/images/I/41h7uHc1jmL._SL500_AA240_.jpg","<a href="+'"'+"http://www.amazon.com/gp/product/images/1592406017/ref=dp_image_0?ie=UTF8&n=283155&s=books"+'"'+" target="+'"'+"AmazonHelp"+'"'+" onclick="+'"'+"return amz_js_PopWin(this.href,'AmazonHelp','width=700,height=600,resizable=1,scrollbars=1,toolbar=0,status=1');"+'"'+" ><img onload="+'"'+"if (typeof uet == 'function') { uet('af'); }"+'"'+" src="+'"'+"http://ecx.images-amazon.com/images/I/41h7uHc1jmL._SL500_AA240_.jpg"+'"'+" id="+'"'+"prodImage"+'"'+" width="+'"'+"240"+'"'+" height="+'"'+"240"+'"'+" border="+'"'+"0"+'"'+" alt="+'"'+"Life, on the Line: A Chef's Story of Chasing Greatness, Facing Death, and Redefining the Way We Eat"+'"'+" onmouseover="+'"'+""+'"'+" /></a>", "<br /><a href="+'"'+"http://www.amazon.com/gp/product/images/1592406017/ref=dp_image_text_0?ie=UTF8&n=283155&s=books"+'"'+" target="+'"'+"AmazonHelp"+'"'+" onclick="+'"'+"return amz_js_PopWin(this.href,'AmazonHelp','width=700,height=600,resizable=1,scrollbars=1,toolbar=0,status=1');"+'"'+" >See larger image</a>", "");
var ivStrings = new Object();
</script>
>>>
>>> l[5518-1][822]
'h'
>>>
注意:在ubuntu 10.04上使用Python 2.6.5
不是BeautifulSoup应该忽略脚本标签吗?
无法找到解决方法:(
任何建议??
答案 0 :(得分:2)
Pyparsing有一些HTML标记支持,可以提供比直接RE更强大的脚本。并且因为它不会尝试解析/处理整个HTML正文,而只是寻找匹配的字符串表达式,它可以处理格式错误的HTML:
html = """<script>
registerImage("original_image",
"this is a closing </script> tag in quotes"
etc....
</script>
"""
# code to strip <script> tags from an HTML page
from pyparsing import makeHTMLTags,SkipTo,quotedString
script,scriptEnd = makeHTMLTags("script")
scriptBody = script + SkipTo(scriptEnd, ignore=quotedString) + scriptEnd
descriptedHtml = scriptBody.suppress().transformString(html)
根据您尝试执行的HTML抓取类型,您可以使用pyparsing完成所有操作。
答案 1 :(得分:0)
当我经常在BeautifulSoup中点击脚本标签时,我会将汤对象转换回字符串,删除有问题的数据,然后重新对数据进行处理。当你不关心数据时工作。