刮擦错误:读取文件“”时出错:无法加载外部实体“”

时间:2018-06-28 10:46:31

标签: python web-scraping scrapy lxml

我正在写刮板刮板。对于某些网站来说,它工作得很好,但对于其他网站,我却报错了

  

读取文件“”时出错:无法加载外部实体“”

这是我为我的刮板编写的代码,不要怪我,但我仍然是python的初学者。

edit info.ts

当我使用scrapy运行代码时,该错误仅在某些网站上发生。它与“”和“”有什么关系吗?感谢您的帮助。

EDIT1: 这是整个错误:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
#from bs4 import BeautifulSoup
import lxml
from lxml.html.clean import Cleaner
#from scrapy.exporters import XmlItemExporter
import re

cleaner = Cleaner()
cleaner.javascript = True
cleaner.style = True
cleaner.remove_tags = ['p', 'div', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'figure', 'small', 'blockquote', 'sub', 'em', 'hr', '!--..--', 'span', 'aside', 'a', 'svg', 'ul', 'li', 'img', 'source', 'nav', 'article', 'section', 'label', 'br', 'noscript', 'body', 'time', 'b', 'i', 'sup', 'strong', 'div']
cleaner.kill_tags = ['header', 'footer']

class MySpider(CrawlSpider):
    name = 'eship5'
    allowed_domains = [
    'ineratec.de',
    ]

    start_urls = [
    'http://ineratec.de/',
    ]

    rules = [Rule(LinkExtractor(), callback='parse_item', follow=True)] # Follow any link scrapy finds (that is allowed).


    def parse_item(self, response):
        page = response.url.replace("/"," ").replace(":"," ")
        filename = '%s.txt' %page
        body = response.url
        clean_text = lxml.html.tostring(cleaner.clean_html(lxml.html.parse(body)))
        #clean_text = re.sub( '\s+', ' ', str(clean_text, "utf-8").replace('<div>', '').replace('</div>', '')).strip()
        with open(filename, 'w') as f:
            f.write(clean_text)

0 个答案:

没有答案