Question

我想使用Python 3创建一个Web抓取应用程序。我试图抓取的网站包含无效的xhtml - 因为它有带有重复属性名称的标签。

我想使用xml.dom.minidom来解析获取的页面。由于重复的属性名称，内容不解析，我出现以下错误：

Traceback (most recent call last):
  File "scraper.py", line 45, in <module>
    scraper.list()
  File "scraper.py", line 34, in list
    dom = parseString(response.text)
  File "C:\Python34\lib\xml\dom\minidom.py", line 1970, in parseString
    return expatbuilder.parseString(string)
  File "C:\Python34\lib\xml\dom\expatbuilder.py", line 925, in parseString
    return builder.parseString(string)
  File "C:\Python34\lib\xml\dom\expatbuilder.py", line 223, in parseString
    parser.Parse(string, True)
xml.parsers.expat.ExpatError: duplicate attribute: line 2, column 43

我想忽略此错误并解析文档。我无法控制icoming的html数据。我该怎么办？

这是我的代码：

import requests
from xml.dom.minidom import parse, parseString


class Scraper:

    def __init__( self ):

        pass

    def list(self,pages=1):

        response = requests.get('http://example.com')

        dom = parseString(response.text)

        print(dom.toxml)


if __name__ == "__main__":

    scraper = Scraper()

    scraper.list()

Answer 1

有一种更好的方法：切换到BeautifulSoup HTML parser。它非常擅长解析格式不正确或损坏的HTML，并且取决于underlying parser library，可以是less or more lenient：

from bs4 import BeautifulSoup
import requests

response = requests.get(url).content
soup = BeautifulSoup(response, "html.parser")  # or use "html5lib", or "lxml"

使用Python3进行Webscraping - 忽略重复的属性错误

1 个答案: