我正在抓取http://www.shipspotting.com/gallery/search.php?limit=192&limitstart=2112&sortkey=p.lid&sortorder=desc&page_limit=192&viewtype=2中的数据(仅此页面用于测试我的抓取工具)。
items.py
import scrapy
class ShipItem(scrapy.Item):
name = scrapy.Field()
imo = scrapy.Field()
category = scrapy.Field()
image_urls = scrapy.Field()
images = scrapy.Field()
class CategoryItem(scrapy.Item):
name = scrapy.Field()
link = scrapy.Field()
settings.py
BOT_NAME = 'ship'
SPIDER_MODULES = ['ship.spiders']
NEWSPIDER_MODULE = 'ship.spiders'
DOWNLOAD_DELAY = 0.5
蜘蛛/ shipspider.py
import scrapy
from ship.items import ShipItem
class ShipSpider(scrapy.Spider):
name = "shipspider"
allowed_domains = ["shipspotting.com"]
page_url = "http://www.shipspotting.com"
start_urls = [
page_url + "/gallery/search.php?limit=192&limitstart=2112&sortkey=p.lid&sortorder=desc&page_limit=192&viewtype=2"
]
def parse(self, response):
ships = response.xpath('/html/body/center/table/tbody/tr/td[1]/table[1]/tbody/tr/td[2]/div[3]/center/table/tbody/tr/td/table[4]/tbody/tr')
for ship in ships:
item = ShipItem()
item['name'] = ship.xpath('td/center/table[1]/tbody/tr/td[2]/span').extract()[0]
yield item
蜘蛛/ categoryspider.py
import scrapy
from ship.items import CategoryItem
class CategorySpider(scrapy.Spider):
name = "catspider"
allowed_domains = ["shipspotting.com"]
page_url = "http://www.shipspotting.com"
start_urls = [
page_url + "/gallery/categories.php"
]
def parse(self, response):
cats = response.xpath('//td[@class="whiteboxstroke"]/a')
file = open('categories.txt', 'a')
for cat in cats:
item = CategoryItem()
item['name'] = cat.xpath('img/@title').extract()[0]
item['link'] = cat.xpath('@href').extract()[0]
yield item
file.close()
catspider
运行得非常完美。但是,shipspider
不起作用。它只显示输出:
2015-06-24 20:15:16+0800 [scrapy] INFO: Scrapy 0.24.6 started (bot: ship)
2015-06-24 20:15:16+0800 [scrapy] INFO: Optional features available: ssl, http11
2015-06-24 20:15:16+0800 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'ship.spiders', 'SPIDER_MODULES': ['ship.spiders'], 'DOWNLOAD_DELAY': 0.5, 'BOT_NAME': 'ship'}
2015-06-24 20:15:16+0800 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-06-24 20:15:16+0800 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-06-24 20:15:16+0800 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-06-24 20:15:16+0800 [scrapy] INFO: Enabled item pipelines:
2015-06-24 20:15:16+0800 [shipspider] INFO: Spider opened
2015-06-24 20:15:16+0800 [shipspider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-06-24 20:15:16+0800 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-06-24 20:15:16+0800 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-06-24 20:15:19+0800 [shipspider] DEBUG: Crawled (200) <GET http://www.shipspotting.com/gallery/search.php?limit=192&limitstart=2112&sortkey=p.lid&sortorder=desc&page_limit=192&viewtype=2> (referer: None)
2015-06-24 20:15:19+0800 [shipspider] INFO: Closing spider (finished)
2015-06-24 20:15:19+0800 [shipspider] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 318,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 477508,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 6, 24, 12, 15, 19, 620358),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2015, 6, 24, 12, 15, 16, 319378)}
2015-06-24 20:15:19+0800 [shipspider] INFO: Spider closed (finished)
我想知道我的xpath是否错误。但是,当我尝试在Chrome中获取这些元素时,一切正常。
那么,我的shippider有一些微妙的问题吗?
答案 0 :(得分:4)
浏览器将tbody添加到表元素中,这就是为什么你的xpath在dev工具中工作但是scrapy失败的原因,这是common gotcha。
通常你需要自己找到xpath,不要相信自动生成的xpath,它们通常很长。例如,要获取有关船只的数据,您可以像这样使用xpath
//tr[td[@class='whiteboxstroke']]
测试你的xpath你应该使用scrapy shell,例如
> scrapy shell "http://www.shipspotting.com/gallery/search.php?limit=192&limitstart=2112&sortkey=p.lid&sortorder=desc&page_limit=192&viewtype=2"
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x7fbf52c122d0>
[s] item {}
[s] request <GET http://www.shipspotting.com/gallery/search.php?limit=192&limitstart=2112&sortkey=p.lid&sortorder=desc&page_limit=192&viewtype=2>
[s] response <200 http://www.shipspotting.com/gallery/search.php?limit=192&limitstart=2112&sortkey=p.lid&sortorder=desc&page_limit=192&viewtype=2>
[s] settings <scrapy.settings.Settings object at 0x7fbf54f5cf90>
[s] spider <DefaultSpider 'default' at 0x7fbf51f6a1d0>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
In [1]: x = "/html/body/center/table/tbody/tr/td[1]/table[1]/tbody/tr/td[2]/div[3]/center/table/tbody/tr/td/table[4]/tbody/tr"
In [2]: response.xpath(x)
Out[2]: []
In [4]: response.xpath("//tr[td[@class='whiteboxstroke']]")
Out[4]:
[<Selector xpath="//tr[td[@class='whiteboxstroke']]" data=u'<tr><td class="whiteboxstroke" style="pa'>,
<Selector xpath="//tr[td[@class='whiteboxstroke']]" data=u'<tr><td class="whiteboxstroke" style="pa'>,
<Selector xpath="//tr[td[@class='whiteboxstroke']]" data=u'<tr><td class="whiteboxstroke" style="pa'>,