我有一个scrapy蜘蛛,使用file:///
命令作为启动URL在磁盘上查找静态html文件,但我无法加载gzip文件并循环遍历我的150,000个文件目录.html.gz
后缀,我已经尝试了几种不同的方法,我已经注释掉了但到目前为止没有任何工作,到目前为止,我的代码看起来像
from scrapy.spiders import CrawlSpider
from Scrapy_new.items import Scrapy_newTestItem
import gzip
import glob
import os.path
class Scrapy_newSpider(CrawlSpider):
name = "info_extract"
source_dir = '/path/to/file/'
allowed_domains = []
start_urls = ['file://///path/to/files/.*html.gz']
def parse_item(self, response):
item = Scrapy_newTestItem()
item['user'] = response.xpath('//*[@id="page-user"]/div[1]/div/div/div[2]/div/div[2]/div[1]/h1/span[2]/text()').extract()
item['list_of_links'] = response.xpath('//*[@id="page-user"]/div[1]/div/div/div[2]/div/div[3]/div[3]/a/@href').extract()
item['list_of_text'] = response.xpath('//*[@id="page-user"]/div[1]/div/div/div/div/div/div/a/text()').extract()
运行此命令会显示错误代码
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 150, in maybeDeferred
result = f(*args, **kw)
File "/usr/local/lib/python2.7/site-packages/scrapy/core/downloader/handlers/file.py", line 13, in download_request
with open(filepath, 'rb') as fo:
IOError: [Errno 2] No such file or directory: 'path/to/files/*.html'
更改我的代码,以便首先解压缩文件,然后按如下方式传递:
source_dir = 'path/to/files/'
for src_name in glob.glob(os.path.join(source_dir, '*.gz')):
base = os.path.basename(src_name)
with gzip.open(src_name, 'rb') as infile:
#start_urls = ['/path/to/files*.html']#
file_cont = infile.read()
start_urls = file_cont#['file:////file_cont']
给出以下错误:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/scrapy/core/engine.py", line 127, in _next_request
request = next(slot.start_requests)
File "/usr/local/lib/python2.7/site-packages/scrapy/spiders/__init__.py", line 70, in start_requests
yield self.make_requests_from_url(url)
File "/usr/local/lib/python2.7/site-packages/scrapy/spiders/__init__.py", line 73, in make_requests_from_url
return Request(url, dont_filter=True)
File "/usr/local/lib/python2.7/site-packages/scrapy/http/request/__init__.py", line 25, in __init__
self._set_url(url)
File "/usr/local/lib/python2.7/site-packages/scrapy/http/request/__init__.py", line 57, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: %3C
答案 0 :(得分:0)
您不必始终在start_urls
蜘蛛上使用scrapy
。此外,CrawlSpider
通常与指定要跟踪的路由以及在大型抓取网站中提取的内容的规则一起使用,您可能希望直接使用scrapy.Spider
而不是CrawlSpider
。
现在,该解决方案依赖于使用start_requests
蜘蛛提供的scrapy
方法,该方法处理蜘蛛的第一个请求。如果在您的蜘蛛中实施此方法,则不会使用start_urls
:
导入gzip 导入glob import os
class ExampleSpider(Spider): name ='info_extract'
def start_requests(self):
os.chdir("/path/to/files")
for file_name in glob.glob("*.html.gz"):
f = gzip.open(file_name, 'rb')
file_content = f.read()
print file_content # now you are reading the file content of your local files
现在,请记住start_requests
must return an iterable of requests,这不是这里的情况,因为您只是在阅读文件(我假设您稍后会使用这些文件的内容创建请求),所以我的代码将失败,例如:
CRITICAL:
Traceback (most recent call last):
...
/.../scrapy/crawler.py", line 73, in crawl
start_requests = iter(self.spider.start_requests())
TypeError: 'NoneType' object is not iterable
我没有从我的start_requests
方法(None
)返回任何内容,这是不可迭代的。
答案 1 :(得分:0)
Scrapy将无法处理压缩的html文件,您必须先提取它们。这可以在Python中即时完成,或者只是在操作系统级别提取它们。