我正在用scrapy编写蜘蛛,以便从几个使用ASP的应用程序中获取一些数据。这两个网页几乎完全相同,需要在开始报废之前登录,但我只设法废弃其中一个。在另一个scrapy中,使用FormRequest
方法在登录后永远等待一些东西。
两个蜘蛛的代码(它们几乎相同但具有不同的IP)如下:
from scrapy.spider import BaseSpider
from scrapy.http import FormRequest
from scrapy.shell import inspect_response
class MySpider(BaseSpider):
name = "my_very_nice_spider"
allowed_domains = ["xxx.xxx.xxx.xxx"]
start_urls = ['http://xxx.xxx.xxx.xxx/reporting/']
def parse(self,response):
#Simulate user login on (http://xxx.xxx.xxx.xxx/reporting/)
return [FormRequest.from_response(response,
formdata={'user':'the_username',
'password':'my_nice_password'},
callback=self.after_login)]
def after_login(self,response):
inspect_response(response,self) #Spider never gets here in one site
if "Bad login" in response.body:
print "Login failed"
return
#Scrapping code begins...
想知道它们之间可能有什么不同我使用Firefox Live HTTP Headers来检查标题,发现只有一个区别:有效的网页使用IIS 6.0,而不是IIS 5.1的网页。
由于这一点无法解释自己为什么一个工作而另一个不能'我使用Wireshark来捕获网络流量并发现了这个:
使用scrapy与工作网页进行交互(IIS 6.0)
scrapy --> webpage GET /reporting/ HTTP/1.1
scrapy <-- webpage HTTP/1.1 200 OK
scrapy --> webpage POST /reporting/ HTTP/1.1 (application/x-www-form-urlencoded)
scrapy <-- webpage HTTP/1.1 302 Object moved
scrapy --> webpage GET /reporting/htm/webpage.asp
scrapy <-- webpage HTTP/1.1 200 OK
scrapy --> webpage POST /reporting/asp/report1.asp
...Scrapping begins
使用scrapy与无法工作的网页进行交互(IIS 5.1)
scrapy --> webpage GET /reporting/ HTTP/1.1
scrapy <-- webpage HTTP/1.1 200 OK
scrapy --> webpage POST /reporting/ HTTP/1.1 (application/x-www-form-urlencoded)
scrapy <-- webpage HTTP/1.1 100 Continue # What the f...?
scrapy <-- webpage HTTP/1.1 302 Object moved
...Scrapy waits forever...
我google了一下,发现确实IIS 5.1有一些很好的“功能”,只要有人向它发送一个POST as shown here就会返回HTTP 100。
知道所有邪恶的根源总是在哪里,但无论如何不得不废弃那个网站......在这种情况下如何进行scrapy工作?或者我做错了什么?
谢谢!
编辑 - 没有工作站点的控制台日志:
2014-01-17 09:09:50-0300 [scrapy] INFO: Scrapy 0.20.2 started (bot: mybot)
2014-01-17 09:09:50-0300 [scrapy] DEBUG: Optional features available: ssl, http11
2014-01-17 09:09:50-0300 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': bot.spiders', 'SPIDER_MODULES': [bot.spiders'], 'BOT_NAME': 'bot'}
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Enabled item pipelines:
2014-01-17 09:09:51-0300 [my_very_nice_spider] INFO: Spider opened
2014-01-17 09:09:51-0300 [my_very_nice_spider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-01-17 09:09:54-0300 [my_very_nice_spider] DEBUG: Crawled (200) <GET http://xxx.xxx.xxx.xxx/reporting/> (referer: None)
2014-01-17 09:10:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 1 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:11:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:12:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:12:54-0300 [my_very_nice_spider] DEBUG: Retrying <POST http://xxx.xxx.xxx.xxx/reporting/> (failed 1 times): User timeout caused connection failure: Getting http://xxx.xxx.xxx.xxx/reporting/ took longer than 180 seconds..
2014-01-17 09:13:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:14:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:15:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:15:54-0300 [my_very_nice_spider] DEBUG: Retrying <POST http://xxx.xxx.xxx.xxx/reporting/> (failed 2 times): User timeout caused connection failure: Getting http://xxx.xxx.xxx.xxx/reporting/ took longer than 180 seconds..
...
答案 0 :(得分:1)
尝试使用HTTP 1.0下载程序:
# settings.py
DOWNLOAD_HANDLERS = {
'http': 'scrapy.core.downloader.handlers.http10.HTTP10DownloadHandler',
'https': 'scrapy.core.downloader.handlers.http10.HTTP10DownloadHandler',
}