我的Scrapy Spider存在问题,它会重新编写“不支持的URL方案”。 我想要一个带有搜索结果的页面。我的蜘蛛一直失败,因为这个长动态URL。
class RadioSpider(CrawlSpider):
name = 'radio'
allowed_domains = ['dashitradio.de']
start_urls = ["[http://www.dashitradio.de/nc/search-in-playlist.html?tx_wfqbe_pi1%5BSTART%5D=2013-06-17%2006:00&tx_wfqbe_pi1%5BEND%5D=2013-06-21%2018:00&tx_wfqbe_pi1%5Bsubmit%5D=Suchen&tx_wfqbe_pi1%5Bshowpage%5D%5B3%5D=1][1]"]
rules = (
Rule(SgmlLinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
)
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
i = RadioItem()
i['title'] = hxs.select("//*[@id='playlist-results']/table//tr[1]/td[1]/text()").extract()
i['interpret'] = hxs.select("//*[@id='playlist-results']/table[1]//tr/td[2]/text()").extract()
i['date'] = hxs.select("//*[@id='playlist-results']/table//tr[1]/td[3]/text()").extract()
return i
如果我在Scrapy Shell控制台中运行它,只能使用除了URL之外的引号,例如"URL"
。
如何让Scrapy接受这个String作为我Spider中的单个URL?
答案 0 :(得分:0)
您的start_urls
设置不正确:[
位于开头,][1]
位于末尾,表示网址无效。
我根据您的评论更新了蜘蛛的代码:
from scrapy.item import Item, Field
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
class RadioItem(Item):
title = Field()
interpret = Field()
date = Field()
class RadioSpider(BaseSpider):
name = 'radio'
allowed_domains = ['dashitradio.de']
start_urls = ["http://www.dashitradio.de/nc/search-in-playlist.html?tx_wfqbe_pi1%5BSTART%5D=2013-06-17%2006:00&tx_wfqbe_pi1%5BEND%5D=2013-06-21%2018:00&tx_wfqbe_pi1%5Bsubmit%5D=Suchen&tx_wfqbe_pi1%5Bshowpage%5D%5B3%5D=1"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
rows = hxs.select("//div[@id='playlist-results']/table/tbody/tr")
for row in rows:
item = RadioItem()
item['title'] = row.select(".//td[1]/text()").extract()[0]
item['interpret'] = row.select(".//td[2]/text()").extract()[0]
item['date'] = row.select(".//td[3]/text()").extract()[0]
yield item
将其保存到my_spider.py
并通过runspider
:
scrapy runspider my_spider.py -o output.json
你会在output.json
:
{"date": "2013-06-21 17:48:00", "interpret": "MUMFORD & SONS", "title": "I WILL WAIT"}
{"date": "2013-06-21 17:44:00", "interpret": "TASMIN ARCHER", "title": "SLEEPING SATELLITE"}
{"date": "2013-06-21 17:40:03", "interpret": "ROBIN THICKE", "title": "BLURRED LINES (feat. T.I. & PHARRELL)"}
{"date": "2013-06-21 17:35:02", "interpret": "TINA TURNER", "title": "TWO PEOPLE"}
{"date": "2013-06-21 17:31:02", "interpret": "BON JOVI", "title": "WHAT ABOUT NOW"}
{"date": "2013-06-21 17:28:03", "interpret": "ROXETTE", "title": "SHE'S GOT NOTHING ON (BUT THE RADIO)"}
{"date": "2013-06-21 17:18:01", "interpret": "GNARLS BARKLEY", "title": "CRAZY"}
{"date": "2013-06-21 17:08:01", "interpret": "FLO RIDA", "title": "WHISTLE"}
{"date": "2013-06-21 17:05:03", "interpret": "WHAM", "title": "WAKE ME UP BEFORE YOU GO GO"}
{"date": "2013-06-21 17:00:03", "interpret": "P!NK FEAT. NATE RUESS", "title": "JUST GIVE ME A REASON"}
{"date": "2013-06-21 16:48:01", "interpret": "SHAKIRA", "title": "WHENEVER, WHEREVER"}
{"date": "2013-06-21 16:44:00", "interpret": "ALPHAVILLE", "title": "BIG IN JAPAN"}
{"date": "2013-06-21 16:40:01", "interpret": "XAVIER NAIDOO", "title": "BEI MEINER SEELE"}
{"date": "2013-06-21 16:36:02", "interpret": "SANTANA", "title": "SMOOTH"}
{"date": "2013-06-21 16:32:01", "interpret": "OLLY MURS", "title": "ARMY OF TWO"}
希望有所帮助。