使用Python3进行Webscraping - 忽略重复的属性错误

时间:2015-08-09 23:48:52

标签: python python-3.x xml-parsing web-scraping html-parsing

我想使用Python 3创建一个Web抓取应用程序。我试图抓取的网站包含无效的xhtml - 因为它有带有重复属性名称的标签。

我想使用xml.dom.minidom来解析获取的页面。由于重复的属性名称,内容不解析,我出现以下错误:

Traceback (most recent call last):
  File "scraper.py", line 45, in <module>
    scraper.list()
  File "scraper.py", line 34, in list
    dom = parseString(response.text)
  File "C:\Python34\lib\xml\dom\minidom.py", line 1970, in parseString
    return expatbuilder.parseString(string)
  File "C:\Python34\lib\xml\dom\expatbuilder.py", line 925, in parseString
    return builder.parseString(string)
  File "C:\Python34\lib\xml\dom\expatbuilder.py", line 223, in parseString
    parser.Parse(string, True)
xml.parsers.expat.ExpatError: duplicate attribute: line 2, column 43

我想忽略此错误并解析文档。我无法控制icoming的html数据。我该怎么办?

这是我的代码:

import requests
from xml.dom.minidom import parse, parseString


class Scraper:

    def __init__( self ):

        pass

    def list(self,pages=1):

        response = requests.get('http://example.com')

        dom = parseString(response.text)

        print(dom.toxml)


if __name__ == "__main__":

    scraper = Scraper()

    scraper.list()

1 个答案:

答案 0 :(得分:1)

有一种更好的方法:切换到BeautifulSoup HTML parser。它非常擅长解析格式不正确或损坏的HTML,并且取决于underlying parser library,可以是less or more lenient

from bs4 import BeautifulSoup
import requests

response = requests.get(url).content
soup = BeautifulSoup(response, "html.parser")  # or use "html5lib", or "lxml"