在使用Request 0.12.1和BeautifulSoup 4.1.0的Kubuntu Linux 12.10上运行的Python 3.2.3上,我解析了一些网页:
try:
response = requests.get('http://www.wbsonline.com/resources/employee-check-tampering-fraud/')
except Exception as error:
return False
pprint(str(type(response)));
pprint(response);
pprint(str(type(response.content)));
soup = bs4.BeautifulSoup(response.content)
请注意,数百个其他网页可以解析。 这个崩溃Python的特定页面是什么,我该如何解决它?这是崩溃:
- bruno:scraper$ ./test-broken-site.py
"<class 'requests.models.Response'>"
<Response [200]>
"<class 'bytes'>"
Traceback (most recent call last):
File "./test-broken-site.py", line 146, in <module>
main(sys.argv)
File "./test-broken-site.py", line 138, in main
has_adsense('http://www.wbsonline.com/resources/employee-check-tampering-fraud/')
File "./test-broken-site.py", line 67, in test_page_parse
soup = bs4.BeautifulSoup(response.content)
File "/usr/lib/python3/dist-packages/bs4/__init__.py", line 172, in __init__
self._feed()
File "/usr/lib/python3/dist-packages/bs4/__init__.py", line 185, in _feed
self.builder.feed(self.markup)
File "/usr/lib/python3/dist-packages/bs4/builder/_lxml.py", line 175, in feed
self.parser.close()
File "parser.pxi", line 1171, in lxml.etree._FeedParser.close (src/lxml/lxml.etree.c:79886)
File "parsertarget.pxi", line 126, in lxml.etree._TargetParserContext._handleParseResult (src/lxml/lxml.etree.c:88932)
File "lxml.etree.pyx", line 282, in lxml.etree._ExceptionContext._raise_if_stored (src/lxml/lxml.etree.c:7469)
File "saxparser.pxi", line 288, in lxml.etree._handleSaxDoctype (src/lxml/lxml.etree.c:85572)
File "parsertarget.pxi", line 84, in lxml.etree._PythonSaxParserTarget._handleSaxDoctype (src/lxml/lxml.etree.c:88469)
File "/usr/lib/python3/dist-packages/bs4/builder/_lxml.py", line 150, in doctype
doctype = Doctype.for_name_and_ids(name, pubid, system)
File "/usr/lib/python3/dist-packages/bs4/element.py", line 720, in for_name_and_ids
return Doctype(value)
File "/usr/lib/python3/dist-packages/bs4/element.py", line 653, in __new__
return str.__new__(cls, value, DEFAULT_OUTPUT_ENCODING)
TypeError: coercing to str: need bytes, bytearray or buffer-like object, NoneType found
而不是bs4.BeautifulSoup(response.content)
I had tried bs4.BeautifulSoup(response.text)
。这具有相同的结果(此页面上的崩溃相同)。 我可以做些什么来处理像这样破坏的网页,以便我可以解析它们?
答案 0 :(得分:1)
您的输出中提供的网站包含doctype:
<!DOCTYPE>
适当的网站必须包含以下内容:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
当beautifulsoup解析器尝试在此处获取doctype时:
File "/usr/lib/python3/dist-packages/bs4/element.py", line 720, in for_name_and_ids
return Doctype(value)
Doctype的值为空,然后在尝试使用该值时,解析器失败。
一种解决方案是在将页面解析为beautifulsoup之前用regex手动修复问题