免责声明:我正在抓取的网站是企业内部网,我修改了网址以保护企业隐私。
我设法登录该网站,但我没有抓取该网站。
从start_url
https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf开始(此网站会引导您访问具有更复杂网址的类似网站:
即。
https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf/h_RoomHome/ade682e34fc59d274825770b0037d278/?OpenDocument {UNID = ADE682E34FC59D274825770B0037D278})
对于包含start_url
的每个网页,我想抓取href
下找到的所有//li/<a>
(对于抓取的每个网页,都会有大量可用的超链接,以及它们将重复,因为您可以访问同一页面上的父站点和子站点。
正如您所看到的,href
并未合并我们在抓取到该页面时看到的实际链接(上面引用的链接)。在其有用的内容前面还有一个#。它会成为问题的根源吗?
对于restricted_xpaths
,我限制了“退出”页面的路径。
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.http import Request, FormRequest
from scrapy.linkextractors import LinkExtractor
import scrapy
class kmssSpider(CrawlSpider):
name='kmss'
start_url = ('https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf',)
login_page = 'https://kmssqkr.ccgo.sarg/LotusQuickr/dept/Main.nsf?OpenDatabase&Login'
allowed_domain = ["kmssqkr.sarg"]
rules= (Rule(LinkExtractor(allow=(r'https://kmssqkr.sarg/LotusQuickr/dept/\w*'),restrict_xpaths=('//*[@id="quickr_widgets_misc_loginlogout_0"]/a'),unique= True),
callback='parse_item', follow = True),
)
# r"LotusQuickr/dept/^[ A-Za-z0-9_@./#&+-]*$"
# restrict_xpaths=('//*[@id="quickr_widgets_misc_loginlogout_0"]/a'),unique = True)
def start_requests(self):
yield Request(url=self.login_page, callback=self.login ,dont_filter = True
)
def login(self,response):
return FormRequest.from_response(response,formdata={'user':'user','password':'pw'},
callback = self.check_login_response)
def check_login_response(self,response):
if 'Welcome' in response.body:
self.log("\n\n\n\n Successfuly Logged in \n\n\n ")
yield Request(url=self.start_url[0])
else:
self.log("\n\n You are not logged in \n\n " )
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
pass
记录:
2015-07-27 16:46:18 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-07-27 16:46:18 [boto] DEBUG: Retrieving credentials from metadata server.
2015-07-27 16:46:19 [boto] ERROR: Caught exception reading instance data
Traceback (most recent call last):
File "C:\Users\hi\AppData\Local\Continuum\Anaconda\lib\site-packages\boto\utils.py", line 210, in retry_url
r = opener.open(req, timeout=timeout)
File "C:\Users\hi\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 431, in open
response = self._open(req, data)
File "C:\Users\hi\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 449, in _open
'_open', req)
File "C:\Users\hi\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 409, in _call_chain
result = func(*args)
File "C:\Users\hi\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 1227, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "C:\Users\hi\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 1197, in do_open
raise URLError(err)
URLError: <urlopen error timed out>
2015-07-27 16:46:19 [boto] ERROR: Unable to read instance data, giving up
2015-07-27 16:46:19 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, HttpProxyMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-07-27 16:46:19 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-07-27 16:46:19 [scrapy] INFO: Enabled item pipelines:
2015-07-27 16:46:19 [scrapy] INFO: Spider opened
2015-07-27 16:46:19 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-07-27 16:46:19 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-07-27 16:46:24 [scrapy] DEBUG: Crawled (200) <GET https://kmssqkr.ccgo.sarg/LotusQuickr/dept/Main.nsf?OpenDatabase&Login> (referer: None)
2015-07-27 16:46:28 [scrapy] DEBUG: Crawled (200) <POST https://kmssqkr.ccgo.sarg/names.nsf?Login> (referer: https://kmssqkr.ccgo.sarg/LotusQuickr/dept/Main.nsf?OpenDatabase&Login)
2015-07-27 16:46:29 [kmss] DEBUG:
Successfuly Logged in
2015-07-27 16:46:29 [scrapy] DEBUG: Redirecting (302) to <GET https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf/h_Toc/d0a58cff88e9100b852572c300517498/?OpenDocument> from <GET https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf>
2015-07-27 16:46:29 [scrapy] DEBUG: Redirecting (302) to <GET https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf/h_RoomHome/ade682e34fc59d274825770b0037d278/?OpenDocument> from <GET https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf/h_Toc/d0a58cff88e9100b852572c300517498/?OpenDocument>
2015-07-27 16:46:29 [scrapy] DEBUG: Crawled (200) <GET https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf/h_RoomHome/ade682e34fc59d274825770b0037d278/?OpenDocument> (referer: https://kmssqkr.sarg/names.nsf?Login)
2015-07-27 16:46:29 [scrapy] INFO: Closing spider (finished)
2015-07-27 16:46:29 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1954,
'downloader/request_count': 5,
'downloader/request_method_count/GET': 4,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 31259,
'downloader/response_count': 5,
'downloader/response_status_count/200': 3,
'downloader/response_status_count/302': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 7, 27, 8, 46, 29, 286000),
'log_count/DEBUG': 8,
'log_count/ERROR': 2,
'log_count/INFO': 7,
'log_count/WARNING': 1,
'request_depth_max': 2,
'response_received_count': 3,
'scheduler/dequeued': 5,
'scheduler/dequeued/memory': 5,
'scheduler/enqueued': 5,
'scheduler/enqueued/memory': 5,
'start_time': datetime.datetime(2015, 7, 27, 8, 46, 19, 528000)}
2015-07-27 16:46:29 [scrapy] INFO: Spider closed (finished)
[1]: http://i.stack.imgur.com/REQXJ.png
----------------------------------修订------------ ---------------------------
我在http://doc.scrapy.org/en/latest/topics/request-response.html中看到了Cookie格式。 这些是我在网站上的cookie,但我不确定是什么以及如何将它们与Request一起添加。
答案 0 :(得分:3)
首先不要求,有时我会生气,不会回答你的问题。
通过Request
启用调试功能,查看哪些Cookie已发送。
然后您会注意到即使Scrapy的中间件应该发送这些cookie,也不会发送cookie。我认为这是因为您COOKIES_DEBUG = True
自定义请求和Scrapy不会比您更聪明,并接受您的解决方案,无需Cookie即可发送此请求。
这意味着您需要访问yield
中的Cookie并将所需的(或全部)添加到response
。