新手在这里scrapy。我试图从桥接网站上抓取一些基本数据,并且由于某种原因我不断被重定向回localhost。
大多数其他网站都没有这种情况(例如教程中的dmoz示例)。我的预感是,我还没有设置一些东西来处理有问题的域名。我的蜘蛛(几乎与教程中的蜘蛛一样,除了URL改变了):
import scrapy
class BboSpider(scrapy.Spider):
name = "bbo"
allowed_domains = ["bridgebase.com"]
start_urls = [
"http://www.bridgebase.com/vugraph/schedule.php"
]
# rules for parsing main response
def parse(self, response):
filename = 'test.html'
with open(filename, 'wb') as f:
f.write(response.body)
我得到的错误是(相关部分):
2016-01-23 14:21:50 [scrapy] INFO: Scrapy 1.0.4 started (bot: bbo)
2016-01-23 14:21:50 [scrapy] INFO: Optional features available: ssl, http11
2016-01-23 14:21:50 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'bbo.spiders', 'SPIDER_MODULES': ['bbo.spiders'], 'BOT_NAME': 'bbo'}
2016-01-23 14:21:50 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-01-23 14:21:50 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-01-23 14:21:50 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-01-23 14:21:50 [scrapy] INFO: Enabled item pipelines:
2016-01-23 14:21:50 [scrapy] INFO: Spider opened
2016-01-23 14:21:50 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-01-23 14:21:50 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-01-23 14:21:54 [scrapy] DEBUG: Redirecting (302) to <GET http://127.0.0.1> from <GET http://www.bridgebase.com/vugraph/schedule.php>
2016-01-23 14:21:54 [scrapy] DEBUG: Retrying <GET http://127.0.0.1> (failed 1 times): Connection was refused by other side: 111: Connection refused.
这可能是一个非常基本的问题,但是我甚至在确定从哪里开始时遇到了很多麻烦。有没有人知道从哪里开始?
答案 0 :(得分:2)
您必须提供User-Agent
标头才能伪装成真正的浏览器。
您可以在从headers
返回scrapy.Request
的同时提供start_requests()
词典,直接在蜘蛛网中执行此操作:
import scrapy
class BboSpider(scrapy.Spider):
name = "bbo"
allowed_domains = ["bridgebase.com"]
def start_requests(self):
yield scrapy.Request("http://www.bridgebase.com/vugraph/schedule.php", headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36"
})
# rules for parsing main response
def parse(self, response):
filename = 'test.html'
with open(filename, 'wb') as f:
f.write(response.body)
或者,您可以设置USER_AGENT
project setting。