我需要帮助。我想为特定网站(underminejournal)做一个爬虫。我想从网站获取这些数据为我创建一个控制台输出,因为我主要在控制台上工作,不想经常切换。另一点是我想在数据库中推送数据(sql等没问题)。但不知何故,当我尝试执行爬虫时,我只是显示了这个,教程对我认为并不是很有帮助:
2016-10-05 10:55:23 [scrapy] INFO: Scrapy 1.0.3 started (bot: undermine)
2016-10-05 10:55:23 [scrapy] INFO: Optional features available: ssl, http11, boto
2016-10-05 10:55:23 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'undermine.spiders', 'SPIDER_MODULES': ['undermine.spiders'], 'BOT_NAME': 'undermine'}
2016-10-05 10:55:23 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-10-05 10:55:23 [boto] DEBUG: Retrieving credentials from metadata server.
2016-10-05 10:55:24 [boto] ERROR: Caught exception reading instance data
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url
r = opener.open(req, timeout=timeout)
File "/usr/lib/python2.7/urllib2.py", line 429, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 447, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1228, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1198, in do_open
raise URLError(err)
URLError: <urlopen error timed out>
2016-10-05 10:55:24 [boto] ERROR: Unable to read instance data, giving up
2016-10-05 10:55:24 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-10-05 10:55:24 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-10-05 10:55:24 [scrapy] INFO: Enabled item pipelines:
2016-10-05 10:55:24 [scrapy] INFO: Spider opened
2016-10-05 10:55:24 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-10-05 10:55:24 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-10-05 10:55:24 [scrapy] ERROR: Error while obtaining start requests
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/scrapy/core/engine.py", line 110, in _next_request
request = next(slot.start_requests)
File "/usr/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 70, in start_requests
yield self.make_requests_from_url(url)
File "/usr/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 73, in make_requests_from_url
return Request(url, dont_filter=True)
File "/usr/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 24, in __init__
self._set_url(url)
File "/usr/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 59, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: theunderminejournal.com/#eu/eredar/item/124442
2016-10-05 10:55:24 [scrapy] INFO: Closing spider (finished)
2016-10-05 10:55:24 [scrapy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 10, 5, 8, 55, 24, 710944),
'log_count/DEBUG': 2,
'log_count/ERROR': 3,
'log_count/INFO': 7,
'start_time': datetime.datetime(2016, 10, 5, 8, 55, 24, 704378)}
2016-10-05 10:55:24 [scrapy] INFO: Spider closed (finished)
我的蜘蛛是这样的:
# -*- coding: utf-8 -*-
import scrapy
class JournalSpider(scrapy.Spider):
name = "journal"
allowed_domains = ["theunderminejournal.com"]
start_urls = (
'theunderminejournal.com/#eu/eredar/item/124442',
)
def parse(self, response):
page = respinse.url.split("/")[-2]
filename = 'journal-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
pass
有人知道提示吗?
编辑结果
2016-10-05 11:21:35 [scrapy] INFO: Scrapy 1.0.3 started (bot: undermine)
2016-10-05 11:21:35 [scrapy] INFO: Optional features available: ssl, http11, boto
2016-10-05 11:21:35 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'undermine.spiders', 'SPIDER_MODULES': ['undermine.spiders'], 'BOT_NAME': 'undermine'}
2016-10-05 11:21:35 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-10-05 11:21:35 [boto] DEBUG: Retrieving credentials from metadata server.
2016-10-05 11:21:36 [boto] ERROR: Caught exception reading instance data
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url
r = opener.open(req, timeout=timeout)
File "/usr/lib/python2.7/urllib2.py", line 429, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 447, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1228, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1198, in do_open
raise URLError(err)
URLError: <urlopen error timed out>
2016-10-05 11:21:36 [boto] ERROR: Unable to read instance data, giving up
答案 0 :(得分:0)
>>> a = np.ma.masked_equal([[0, 0, 0, 0], [10, 11, 0, 0], [12, 14, 0, 0], [0, 0, 17, 0]], 0)
>>> b = np.ma.masked_equal([[0, 5, 0, 0], [0, 0, 9, 0], [0, 15, 8, 13], [0, 0, 19, 16]], 0)
>>> a[~b.mask] = b.compressed()
>>> a
[[-- 5 -- --]
[10 11 9 --]
[12 15 8 13]
[-- -- 19 16]]
您的网址应始终以ValueError: Missing scheme in request url: theunderminejournal.com/#eu/eredar/item/124442
或http://
开头。
https://