我使用Scrapy shell作为网址http://www.yelp.com/search?find_desc=&find_loc=60089
我需要获取该链接中的数据和URL .. 例如,我需要废弃该链接中的以下数据
我用过
hxs.select( '//跨度[@类= “索引-BIZ-名”] /一个/文本()')。提取物()
用于提取该数据的命令
我尝试了很多方法来获取其他数据,这与该页面无关。
请将代码发送给我/ as./p>
答案 0 :(得分:0)
你的表达有效:
paul@wheezy:~$ scrapy shell "http://www.yelp.com/search?find_desc=&find_loc=60089"
2014-01-29 22:48:22+0100 [scrapy] INFO: Scrapy 0.23.0 started (bot: scrapybot)
2014-01-29 22:48:22+0100 [scrapy] INFO: Optional features available: ssl, http11, boto, django
2014-01-29 22:48:22+0100 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0}
2014-01-29 22:48:22+0100 [scrapy] INFO: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-01-29 22:48:22+0100 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-01-29 22:48:22+0100 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-01-29 22:48:22+0100 [scrapy] INFO: Enabled item pipelines:
2014-01-29 22:48:22+0100 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-01-29 22:48:22+0100 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-01-29 22:48:22+0100 [default] INFO: Spider opened
2014-01-29 22:48:24+0100 [default] DEBUG: Crawled (200) <GET http://www.yelp.com/search?find_desc=&find_loc=60089> (referer: None)
[s] Available Scrapy objects:
[s] item {}
[s] request <GET http://www.yelp.com/search?find_desc=&find_loc=60089>
[s] response <200 http://www.yelp.com/search?find_desc=&find_loc=60089>
[s] sel <Selector xpath=None data=u'<html xmlns:fb="http://www.facebook.com/'>
[s] settings <CrawlerSettings module=None>
[s] spider <Spider 'default' at 0x3ba6b50>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
In [1]: sel.xpath('//span[@class="indexed-biz-name"]/a/text()').extract()
Out[1]:
[u'Firewood Kabob Mediterranean Grill',
u"Lou Malnati's Pizzeria",
u'Hakuya Sushi',
u'Nails & Spa Studio',
u'Wooil Korean Restaurant',
u"Grande Jake's Fresh Mexican Grill",
u'Hanabi Japanese Restaurant',
u'India House',
u'Deerfields Bakery',
u'Wiener Take All']
In [2]: