在scrapy中刮取大量静态html.gz文件

时间:2017-03-13 18:55:50

标签: python python-2.7 web-scraping scrapy gzip

我有一个scrapy蜘蛛,使用file:///命令作为启动URL在磁盘上查找静态html文件,但我无法加载gzip文件并循环遍历我的150,000个文件目录.html.gz后缀,我已经尝试了几种不同的方法,我已经注释掉了但到目前为止没有任何工作,到目前为止,我的代码看起来像

    from scrapy.spiders import CrawlSpider
    from Scrapy_new.items import Scrapy_newTestItem
    import gzip
    import glob
    import os.path

class Scrapy_newSpider(CrawlSpider):
        name = "info_extract"
        source_dir = '/path/to/file/'
        allowed_domains = []
        start_urls = ['file://///path/to/files/.*html.gz']
    def parse_item(self, response):
            item = Scrapy_newTestItem()
            item['user'] = response.xpath('//*[@id="page-user"]/div[1]/div/div/div[2]/div/div[2]/div[1]/h1/span[2]/text()').extract()
            item['list_of_links'] = response.xpath('//*[@id="page-user"]/div[1]/div/div/div[2]/div/div[3]/div[3]/a/@href').extract()
            item['list_of_text'] = response.xpath('//*[@id="page-user"]/div[1]/div/div/div/div/div/div/a/text()').extract()

运行此命令会显示错误代码

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 150, in maybeDeferred
    result = f(*args, **kw)
  File "/usr/local/lib/python2.7/site-packages/scrapy/core/downloader/handlers/file.py", line 13, in download_request
    with open(filepath, 'rb') as fo:
IOError: [Errno 2] No such file or directory: 'path/to/files/*.html'

更改我的代码,以便首先解压缩文件,然后按如下方式传递:

source_dir = 'path/to/files/'    
for src_name in glob.glob(os.path.join(source_dir, '*.gz')):
     base = os.path.basename(src_name)
     with gzip.open(src_name, 'rb') as infile:
          #start_urls = ['/path/to/files*.html']#
          file_cont = infile.read()
          start_urls = file_cont#['file:////file_cont']

给出以下错误:

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/scrapy/core/engine.py", line 127, in _next_request
    request = next(slot.start_requests)
  File "/usr/local/lib/python2.7/site-packages/scrapy/spiders/__init__.py", line 70, in start_requests
    yield self.make_requests_from_url(url)
  File "/usr/local/lib/python2.7/site-packages/scrapy/spiders/__init__.py", line 73, in make_requests_from_url
    return Request(url, dont_filter=True)
  File "/usr/local/lib/python2.7/site-packages/scrapy/http/request/__init__.py", line 25, in __init__
    self._set_url(url)
  File "/usr/local/lib/python2.7/site-packages/scrapy/http/request/__init__.py", line 57, in _set_url
    raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: %3C

2 个答案:

答案 0 :(得分:0)

您不必始终在start_urls蜘蛛上使用scrapy。此外,CrawlSpider通常与指定要跟踪的路由以及在大型抓取网站中提取的内容的规则一起使用,您可能希望直接使用scrapy.Spider而不是CrawlSpider

现在,该解决方案依赖于使用start_requests蜘蛛提供的scrapy方法,该方法处理蜘蛛的第一个请求。如果在您的蜘蛛中实施此方法,则不会使用start_urls

来自scrapy import Spider

导入gzip 导入glob import os

class ExampleSpider(Spider):     name ='info_extract'

def start_requests(self):
    os.chdir("/path/to/files")
    for file_name in glob.glob("*.html.gz"):
        f = gzip.open(file_name, 'rb')
        file_content = f.read()
        print file_content # now you are reading the file content of your local files

现在,请记住start_requests must return an iterable of requests,这不是这里的情况,因为您只是在阅读文件(我假设您稍后会使用这些文件的内容创建请求),所以我的代码将失败,例如:

CRITICAL:
Traceback (most recent call last):
  ...
/.../scrapy/crawler.py", line 73, in crawl
    start_requests = iter(self.spider.start_requests())
TypeError: 'NoneType' object is not iterable

我没有从我的start_requests方法(None)返回任何内容,这是不可迭代的。

答案 1 :(得分:0)

Scrapy将无法处理压缩的html文件,您必须先提取它们。这可以在Python中即时完成,或者只是在操作系统级别提取它们。

相关:Python Scrapy on offline (local) data