Scrapy需要30多分钟才能执行任何蜘蛛或scrapy工作台

时间:2015-11-13 18:18:13

标签: scrapy

我有一个使用Scrapy 1.0.3的项目。一切都运行良好,在没有发生重大变化后,蜘蛛至少需要30分钟才能完成。以下是prod环境中的一些日志:

0:  2015-11-13 12:00:50 INFO    Log opened.
1:  2015-11-13 12:00:50 INFO    [scrapy.log] Scrapy 1.0.3.post6+g2d688cd started
2:  2015-11-13 12:39:26 INFO    [scrapy.utils.log] Scrapy 1.0.3.post6+g2d688cd started (bot: fancy)
3:  2015-11-13 12:39:26 INFO    [scrapy.utils.log] Optional features available: ssl, http11, boto

从日志中可以看出,它甚至需要大约40分钟才能启动。

如果我运行scrapy benchscrapy listscrapy check,我的控制台会出现同样的问题。

有没有人有任何想法?

我已经在我们的开发和生产环境中检查了这个问题并且遇到了同样的问题。

我认为它可能与代码有关,但如果它影响了基本的scrapy命令,我对这可能会有什么感到困惑。

正常的python脚本执行没有问题。

取消运行时,这是回溯:

^CTraceback (most recent call last):
  File "/home/nitrous/code/trendomine/bin/scrapy", line 11, in <module>
    sys.exit(execute())
  File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/scrapy/cmdline.py", line 142, in execute
    cmd.crawler_process = CrawlerProcess(settings)
  File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/scrapy/crawler.py", line 209, in __init__
    super(CrawlerProcess, self).__init__(settings)
  File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/scrapy/crawler.py", line 115, in __init__
    self.spider_loader = _get_spider_loader(settings)
      File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/scrapy/crawler.py", line 296, in _get_spider_loader
    return loader_cls.from_settings(settings.frozencopy())
  File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/scrapy/spiderloader.py", line 30, in from_settings
    return cls(settings)
  File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/scrapy/spiderloader.py", line 21, in __init__
    for module in walk_modules(name):
  File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/scrapy/utils/misc.py", line 71, in walk_modules
    submod = import_module(fullpath)
  File "/usr/lib/python2.7/importlib/__init__.py", line 37, in import_module
    __import__(name)
  File "/home/nitrous/code/trendomine/fancy/fancy/spiders/fancy_update_spider.py", line 11, in <module>
    class FancyUpdateSpider(scrapy.Spider):
  File "/home/nitrous/code/trendomine/fancy/fancy/spiders/fancy_update_spider.py", line 28, in FancyUpdateSpider
    pg_r = requests.get(url, headers=headers)
  File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/requests/api.py", line 69, in get
    return request('get', url, params=params, **kwargs)
  File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/requests/api.py", line 50, in request
    response = session.request(method=method, url=url, **kwargs)
  File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/requests/sessions.py", line 468, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/requests/sessions.py", line 576, in send
    r = adapter.send(request, **kwargs)
  File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/requests/adapters.py", line 370, in send
    timeout=timeout
  File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py", line 559, in urlopen
    body=body, headers=headers)
  File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py", line 376, in _make_request
    httplib_response = conn.getresponse(buffering=True)
  File "/usr/lib/python2.7/httplib.py", line 1051, in getresponse
    response.begin()
  File "/usr/lib/python2.7/httplib.py", line 415, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python2.7/httplib.py", line 371, in _read_status
    line = self.fp.readline(_MAXLINE + 1)
  File "/usr/lib/python2.7/socket.py", line 476, in readline
    data = self._sock.recv(self._rbufsize)
  File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/requests/packages/urllib3/contrib/pyopenssl.py", line 179, in recv
    data = self.connection.recv(*args, **kwargs)
  File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1319, in recv
    result = _lib.SSL_read(self._ssl, buf, bufsiz)
KeyboardInterrupt

由于

1 个答案:

答案 0 :(得分:0)

这个问题是我正在执行一个巨大的get请求来生成一个JSON文件,用作脚本开头的start_urls。为了解决这个问题,我将其包装在def start_requests(self)中,而不是对所有请求使用巨大的JSON,我在每个JSON get请求之后都这样做了。

新代码:

import scrapy
from urlparse import urljoin
import re
import json
import requests
import math
from scrapy.conf import settings

from fancy.items import FancyItem

def roundup(x):
  return int(math.ceil(x / 10.0)) * 10

class FancyUpdateSpider(scrapy.Spider):

  name = 'fancy_update'
  allowed_domains = ['foo.com']

  def start_requests(self):
    # Get URLS
    url = 'https://www.foo.com/api/v1/product_urls?q%5Bcompany_in%5D%5B%5D=foo'
    headers = {'X-Api-Key': settings['API_KEY'], 'Content-Type': 'application/json'}
    r = requests.get(url, headers=headers)
    # Get initial data
    start_urls_data = json.loads(r.content)
    # Grab the total number of products and round up to nearest 10
    count = roundup(int(r.headers['count']))
    pages = (count / 10) + 1
    for x in start_urls_data:
      yield scrapy.Request(x["link"], dont_filter=True)

    for i in range (2, pages):
      pg_url = 'https://www.foo.com/api/v1/product_urls?q%5Bcompany_in%5D%5B%5D=Fancy&page={0}'.format(i)
      print pg_url
      pg_r = requests.get(pg_url, headers=headers)
      # Add remaining data to the JSON
      additional_start_urls_data = json.loads(pg_r.content)
      for x in additional_start_urls_data:
        yield scrapy.Request(x["link"], dont_filter=True)

  def parse(self, response):
    item = FancyItem()
    item['link'] = response.url
    item['interest'] = response.xpath("//div[@class='frm']/div[@class='figure-button']/a[contains(@class, 'fancyd_list')]/text()").extract_first()
    return item