我正在写刮板刮板。对于某些网站来说,它工作得很好,但对于其他网站,我却报错了
读取文件“”时出错:无法加载外部实体“”
这是我为我的刮板编写的代码,不要怪我,但我仍然是python的初学者。
edit info.ts
当我使用scrapy运行代码时,该错误仅在某些网站上发生。它与“”和“”有什么关系吗?感谢您的帮助。
EDIT1: 这是整个错误:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
#from bs4 import BeautifulSoup
import lxml
from lxml.html.clean import Cleaner
#from scrapy.exporters import XmlItemExporter
import re
cleaner = Cleaner()
cleaner.javascript = True
cleaner.style = True
cleaner.remove_tags = ['p', 'div', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'figure', 'small', 'blockquote', 'sub', 'em', 'hr', '!--..--', 'span', 'aside', 'a', 'svg', 'ul', 'li', 'img', 'source', 'nav', 'article', 'section', 'label', 'br', 'noscript', 'body', 'time', 'b', 'i', 'sup', 'strong', 'div']
cleaner.kill_tags = ['header', 'footer']
class MySpider(CrawlSpider):
name = 'eship5'
allowed_domains = [
'ineratec.de',
]
start_urls = [
'http://ineratec.de/',
]
rules = [Rule(LinkExtractor(), callback='parse_item', follow=True)] # Follow any link scrapy finds (that is allowed).
def parse_item(self, response):
page = response.url.replace("/"," ").replace(":"," ")
filename = '%s.txt' %page
body = response.url
clean_text = lxml.html.tostring(cleaner.clean_html(lxml.html.parse(body)))
#clean_text = re.sub( '\s+', ' ', str(clean_text, "utf-8").replace('<div>', '').replace('</div>', '')).strip()
with open(filename, 'w') as f:
f.write(clean_text)