我制作了一只蜘蛛来抓取需要登录的论坛。我在登录页面上启动它。登录成功后,我将蜘蛛指向的页面出现问题。
如果我打开规则接受所有链接,蜘蛛会成功跟随登录页面上的链接。但是,它不会跟随我使用Request()提供的页面上的任何链接。这表明它并不是因为搞砸了xpath。
登录似乎有效 - page_parse函数将页面源写入文本文件,源代码来自我正在寻找的页面,只有在登录后才能到达。但是,管道我有一个截取每个页面的截图捕获登录页面,但不是我随后发送到的页面。
这是蜘蛛:
class PLMSpider(CrawlSpider):
name = 'plm'
allowed_domains = ["patientslikeme.com"]
start_urls = [
"https://www.patientslikeme.com/login"
]
rules = (
Rule(SgmlLinkExtractor(allow=(r"patientslikeme.com/login")), callback='login_parse', follow=True),
Rule(SgmlLinkExtractor(restrict_xpaths=("//div[@class='content-section']")), callback='post_parse', follow=False),
Rule(SgmlLinkExtractor(restrict_xpaths=("//div[@class='pagination']")), callback='page_parse', follow=True),
)
def __init__(self, **kwargs):
ScrapyFileLogObserver(open("debug.log", 'w'), level=logging.DEBUG).start()
CrawlSpider.__init__(self, **kwargs)
def post_parse(self, response):
url = response.url
log.msg("Post parse attempted for {0}".format(url))
item = PLMItem()
item['url'] = url
return item
def page_parse(self, response):
url = response.url
log.msg("Page parse attempted for {0}".format(url))
item = PLMItem()
item['url'] = url
f = open("body.txt", "w")
f.write(response.body)
f.close()
return item
def login_parse(self, response):
log.msg("Login attempted")
return [FormRequest.from_response(response,
formdata={'userlogin[login]': username, 'userlogin[password]': password},
callback=self.after_login)]
def after_login(self, response):
log.msg("Post login")
if "Login unsuccessful" in response.body:
self.log("Login failed", level=log.ERROR)
return
else:
return Request(url="https://www.patientslikeme.com/forum/diabetes2/topics",
callback=self.page_parse)
这是我的调试日志:
2014-03-21 18:22:05+0000 [scrapy] INFO: Scrapy 0.18.2 started (bot: plm)
2014-03-21 18:22:05+0000 [scrapy] DEBUG: Optional features available: ssl, http11
2014-03-21 18:22:05+0000 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'plm.spiders', 'ITEM_PIPELINES': {'plm.pipelines.ScreenshotPipeline': 1}, 'DEPTH_LIMIT': 5, 'SPIDER_MODULES': ['plm.spiders'], 'BOT_NAME': 'plm', 'DEPTH_PRIORITY': 1, 'SCHEDULER_MEMORY_QUEUE': 'scrapy.squeue.FifoMemoryQueue', 'SCHEDULER_DISK_QUEUE': 'scrapy.squeue.PickleFifoDiskQueue'}
2014-03-21 18:22:05+0000 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-03-21 18:22:06+0000 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-03-21 18:22:06+0000 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-03-21 18:22:06+0000 [scrapy] DEBUG: Enabled item pipelines: ScreenshotPipeline
2014-03-21 18:22:06+0000 [plm] INFO: Spider opened
2014-03-21 18:22:06+0000 [plm] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-03-21 18:22:07+0000 [scrapy] INFO: Screenshooter initiated
2014-03-21 18:22:07+0000 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-03-21 18:22:07+0000 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-03-21 18:22:08+0000 [plm] DEBUG: Crawled (200) <GET https://www.patientslikeme.com/login> (referer: None)
2014-03-21 18:22:08+0000 [plm] DEBUG: Crawled (200) <GET https://www.patientslikeme.com/login> (referer: https://www.patientslikeme.com/login)
2014-03-21 18:22:08+0000 [scrapy] INFO: Login attempted
2014-03-21 18:22:08+0000 [plm] DEBUG: Filtered duplicate request: <GET https://www.patientslikeme.com/login> - no more duplicates will be shown (see DUPEFILTER_CLASS)
2014-03-21 18:22:09+0000 [plm] DEBUG: Redirecting (302) to <GET https://www.patientslikeme.com/profile/activity/all> from <POST https://www.patientslikeme.com/login>
2014-03-21 18:22:10+0000 [plm] DEBUG: Crawled (200) <GET https://www.patientslikeme.com/profile/activity/all> (referer: https://www.patientslikeme.com/login)
2014-03-21 18:22:10+0000 [scrapy] INFO: Post login
2014-03-21 18:22:10+0000 [plm] DEBUG: Crawled (200) <GET https://www.patientslikeme.com/forum/diabetes2/topics> (referer: https://www.patientslikeme.com/profile/activity/all)
2014-03-21 18:22:10+0000 [scrapy] INFO: Page parse attempted for https://www.patientslikeme.com/forum/diabetes2/topics
2014-03-21 18:22:10+0000 [scrapy] INFO: Screenshot attempted for https://www.patientslikeme.com/forum/diabetes2/topics
2014-03-21 18:22:15+0000 [plm] DEBUG: Scraped from <200 https://www.patientslikeme.com/forum/diabetes2/topics>
{'url': 'https://www.patientslikeme.com/forum/diabetes2/topics'}
2014-03-21 18:22:15+0000 [plm] INFO: Closing spider (finished)
2014-03-21 18:22:15+0000 [plm] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2068,
'downloader/request_count': 5,
'downloader/request_method_count/GET': 4,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 53246,
'downloader/response_count': 5,
'downloader/response_status_count/200': 4,
'downloader/response_status_count/302': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 3, 21, 18, 22, 15, 177000),
'item_scraped_count': 1,
'log_count/DEBUG': 13,
'log_count/INFO': 8,
'request_depth_max': 3,
'response_received_count': 4,
'scheduler/dequeued': 5,
'scheduler/dequeued/memory': 5,
'scheduler/enqueued': 5,
'scheduler/enqueued/memory': 5,
'start_time': datetime.datetime(2014, 3, 21, 18, 22, 6, 377000)}
2014-03-21 18:22:15+0000 [plm] INFO: Spider closed (finished)
感谢您提供任何帮助。
----编辑----
我试图实施Paul t。的建议。不幸的是,我收到了以下错误:
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\scrapy\crawler.py", line 93, in start
if self.start_crawling():
File "C:\Python27\lib\site-packages\scrapy\crawler.py", line 168, in start_crawling
return self.start_crawler() is not None
File "C:\Python27\lib\site-packages\scrapy\crawler.py", line 158, in start_crawler
crawler.start()
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 1213, in unwindGenerator
return _inlineCallbacks(None, gen, Deferred())
--- <exception caught here> ---
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 1070, in _inlineCallbacks
result = g.send(result)
File "C:\Python27\lib\site-packages\scrapy\crawler.py", line 74, in start
yield self.schedule(spider, batches)
File "C:\Python27\lib\site-packages\scrapy\crawler.py", line 61, in schedule
requests.extend(batch)
exceptions.TypeError: 'Request' object is not iterable
由于它没有确定应该责怪蜘蛛的某个特定部分,因此我很难找到解决问题的地方。
----编辑2 ----
问题是由Paul t。提供的start_requests函数引起的,它使用return而不是yield。如果我把它改成屈服,那就完美了。
答案 0 :(得分:4)
我的建议是使用以下方法欺骗CrawlSpider:
start_urls
开头一样
以下是对此的说明:
class PLMSpider(CrawlSpider):
name = 'plm'
allowed_domains = ["patientslikeme.com"]
# pseudo-start_url
login_url = "https://www.patientslikeme.com/login"
# start URLs used after login
start_urls = [
"https://www.patientslikeme.com/forum/diabetes2/topics",
]
rules = (
# you want to do the login only once, so it's probably cleaner
# not to ask the CrawlSpider to follow links to the login page
#Rule(SgmlLinkExtractor(allow=(r"patientslikeme.com/login")), callback='login_parse', follow=True),
# you can also deny "/login" to be safe
Rule(SgmlLinkExtractor(restrict_xpaths=("//div[@class='content-section']"),
deny=('/login',)),
callback='post_parse', follow=False),
Rule(SgmlLinkExtractor(restrict_xpaths=("//div[@class='pagination']"),
deny=('/login',)),
callback='page_parse', follow=True),
)
def __init__(self, **kwargs):
ScrapyFileLogObserver(open("debug.log", 'w'), level=logging.DEBUG).start()
CrawlSpider.__init__(self, **kwargs)
# by default start_urls pages will be sent to the parse method,
# but parse() is rather special in CrawlSpider
# so I suggest you create your own initial login request "manually"
# and ask for it to be parsed by your specific callback
def start_requests(self):
yield Request(self.login_url, callback=self.login_parse)
# you've got the login page, send credentials
# (so far so good...)
def login_parse(self, response):
log.msg("Login attempted")
return [FormRequest.from_response(response,
formdata={'userlogin[login]': username, 'userlogin[password]': password},
callback=self.after_login)]
# so we got a response to the login thing
# if we're good,
# just do as if we were starting the crawl now,
# basically doing what happens when you use start_urls
def after_login(self, response):
log.msg("Post login")
if "Login unsuccessful" in response.body:
self.log("Login failed", level=log.ERROR)
return
else:
return [Request(url=u) for u in self.start_urls]
# alternatively, you could even call CrawlSpider's start_requests() method directly
# that's probably cleaner
#return super(PLMSpider, self).start_requests()
def post_parse(self, response):
url = response.url
log.msg("Post parse attempted for {0}".format(url))
item = PLMItem()
item['url'] = url
return item
def page_parse(self, response):
url = response.url
log.msg("Page parse attempted for {0}".format(url))
item = PLMItem()
item['url'] = url
f = open("body.txt", "w")
f.write(response.body)
f.close()
return item
# if you want the start_urls pages to be parsed,
# you need to tell CrawlSpider to do so by defining parse_start_url attribute
# https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/spiders/crawl.py#L38
parse_start_url = page_parse
答案 1 :(得分:0)
您的登录页面由方法parse_start_url
解析。
您应该重新定义解析登录页面的方法。
看看documentation。