我有一个使用Scrapy 1.0.3
的项目。一切都运行良好,在没有发生重大变化后,蜘蛛至少需要30分钟才能完成。以下是prod环境中的一些日志:
0: 2015-11-13 12:00:50 INFO Log opened.
1: 2015-11-13 12:00:50 INFO [scrapy.log] Scrapy 1.0.3.post6+g2d688cd started
2: 2015-11-13 12:39:26 INFO [scrapy.utils.log] Scrapy 1.0.3.post6+g2d688cd started (bot: fancy)
3: 2015-11-13 12:39:26 INFO [scrapy.utils.log] Optional features available: ssl, http11, boto
从日志中可以看出,它甚至需要大约40分钟才能启动。
如果我运行scrapy bench
,scrapy list
或scrapy check
,我的控制台会出现同样的问题。
有没有人有任何想法?
我已经在我们的开发和生产环境中检查了这个问题并且遇到了同样的问题。
我认为它可能与代码有关,但如果它影响了基本的scrapy命令,我对这可能会有什么感到困惑。
正常的python脚本执行没有问题。
取消运行时,这是回溯:
^CTraceback (most recent call last):
File "/home/nitrous/code/trendomine/bin/scrapy", line 11, in <module>
sys.exit(execute())
File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/scrapy/cmdline.py", line 142, in execute
cmd.crawler_process = CrawlerProcess(settings)
File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/scrapy/crawler.py", line 209, in __init__
super(CrawlerProcess, self).__init__(settings)
File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/scrapy/crawler.py", line 115, in __init__
self.spider_loader = _get_spider_loader(settings)
File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/scrapy/crawler.py", line 296, in _get_spider_loader
return loader_cls.from_settings(settings.frozencopy())
File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/scrapy/spiderloader.py", line 30, in from_settings
return cls(settings)
File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/scrapy/spiderloader.py", line 21, in __init__
for module in walk_modules(name):
File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/scrapy/utils/misc.py", line 71, in walk_modules
submod = import_module(fullpath)
File "/usr/lib/python2.7/importlib/__init__.py", line 37, in import_module
__import__(name)
File "/home/nitrous/code/trendomine/fancy/fancy/spiders/fancy_update_spider.py", line 11, in <module>
class FancyUpdateSpider(scrapy.Spider):
File "/home/nitrous/code/trendomine/fancy/fancy/spiders/fancy_update_spider.py", line 28, in FancyUpdateSpider
pg_r = requests.get(url, headers=headers)
File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/requests/api.py", line 69, in get
return request('get', url, params=params, **kwargs)
File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/requests/api.py", line 50, in request
response = session.request(method=method, url=url, **kwargs)
File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/requests/sessions.py", line 468, in request
resp = self.send(prep, **send_kwargs)
File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/requests/sessions.py", line 576, in send
r = adapter.send(request, **kwargs)
File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/requests/adapters.py", line 370, in send
timeout=timeout
File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py", line 559, in urlopen
body=body, headers=headers)
File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py", line 376, in _make_request
httplib_response = conn.getresponse(buffering=True)
File "/usr/lib/python2.7/httplib.py", line 1051, in getresponse
response.begin()
File "/usr/lib/python2.7/httplib.py", line 415, in begin
version, status, reason = self._read_status()
File "/usr/lib/python2.7/httplib.py", line 371, in _read_status
line = self.fp.readline(_MAXLINE + 1)
File "/usr/lib/python2.7/socket.py", line 476, in readline
data = self._sock.recv(self._rbufsize)
File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/requests/packages/urllib3/contrib/pyopenssl.py", line 179, in recv
data = self.connection.recv(*args, **kwargs)
File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1319, in recv
result = _lib.SSL_read(self._ssl, buf, bufsiz)
KeyboardInterrupt
由于
答案 0 :(得分:0)
这个问题是我正在执行一个巨大的get请求来生成一个JSON文件,用作脚本开头的start_urls
。为了解决这个问题,我将其包装在def start_requests(self)
中,而不是对所有请求使用巨大的JSON,我在每个JSON get请求之后都这样做了。
新代码:
import scrapy
from urlparse import urljoin
import re
import json
import requests
import math
from scrapy.conf import settings
from fancy.items import FancyItem
def roundup(x):
return int(math.ceil(x / 10.0)) * 10
class FancyUpdateSpider(scrapy.Spider):
name = 'fancy_update'
allowed_domains = ['foo.com']
def start_requests(self):
# Get URLS
url = 'https://www.foo.com/api/v1/product_urls?q%5Bcompany_in%5D%5B%5D=foo'
headers = {'X-Api-Key': settings['API_KEY'], 'Content-Type': 'application/json'}
r = requests.get(url, headers=headers)
# Get initial data
start_urls_data = json.loads(r.content)
# Grab the total number of products and round up to nearest 10
count = roundup(int(r.headers['count']))
pages = (count / 10) + 1
for x in start_urls_data:
yield scrapy.Request(x["link"], dont_filter=True)
for i in range (2, pages):
pg_url = 'https://www.foo.com/api/v1/product_urls?q%5Bcompany_in%5D%5B%5D=Fancy&page={0}'.format(i)
print pg_url
pg_r = requests.get(pg_url, headers=headers)
# Add remaining data to the JSON
additional_start_urls_data = json.loads(pg_r.content)
for x in additional_start_urls_data:
yield scrapy.Request(x["link"], dont_filter=True)
def parse(self, response):
item = FancyItem()
item['link'] = response.url
item['interest'] = response.xpath("//div[@class='frm']/div[@class='figure-button']/a[contains(@class, 'fancyd_list')]/text()").extract_first()
return item