我想使用Python 3创建一个Web抓取应用程序。我试图抓取的网站包含无效的xhtml - 因为它有带有重复属性名称的标签。
我想使用xml.dom.minidom来解析获取的页面。由于重复的属性名称,内容不解析,我出现以下错误:
Traceback (most recent call last):
File "scraper.py", line 45, in <module>
scraper.list()
File "scraper.py", line 34, in list
dom = parseString(response.text)
File "C:\Python34\lib\xml\dom\minidom.py", line 1970, in parseString
return expatbuilder.parseString(string)
File "C:\Python34\lib\xml\dom\expatbuilder.py", line 925, in parseString
return builder.parseString(string)
File "C:\Python34\lib\xml\dom\expatbuilder.py", line 223, in parseString
parser.Parse(string, True)
xml.parsers.expat.ExpatError: duplicate attribute: line 2, column 43
我想忽略此错误并解析文档。我无法控制icoming的html数据。我该怎么办?
这是我的代码:
import requests
from xml.dom.minidom import parse, parseString
class Scraper:
def __init__( self ):
pass
def list(self,pages=1):
response = requests.get('http://example.com')
dom = parseString(response.text)
print(dom.toxml)
if __name__ == "__main__":
scraper = Scraper()
scraper.list()
答案 0 :(得分:1)
有一种更好的方法:切换到BeautifulSoup
HTML parser。它非常擅长解析格式不正确或损坏的HTML,并且取决于underlying parser library,可以是less or more lenient:
from bs4 import BeautifulSoup
import requests
response = requests.get(url).content
soup = BeautifulSoup(response, "html.parser") # or use "html5lib", or "lxml"