我使用scrapy来从网站中提取数据。 当我打开json结果文件时,它总是返回空。 我附上了scrapy代码:
from scrapy import Spider
class StackSpider(Spider):
name = "stack"
allowed_domains = ["youtube.com"]
start_urls = ["https://www.youtube.com/results?search_query=Motorcycle+Accident+Stunt+Rider+Knocks+Himself+Out+Stunt+Fail+2015"]
def parse(self,response):
questions = Selector(response).xpath('//a')
for question in questions:
item = StackItem()
item['title'] = question.xpath(
'a/text()').extract()
item['url'] = question.xpath('//@href]').extract()
yield item
答案 0 :(得分:0)
我猜你正在抓取节点的文本元素和href属性。您只需更改xpath即可获得结果。
尝试以下代码
item['title'] = question.xpath('./text()').extract()
item['url'] = question.xpath('./@href]').extract()
这是我在scrapy shell中尝试这些的一些输出
In [38]: questions = Selector(response).xpath('//a')
In [39]: for question in questions:
print question.xpath('./text()').extract()
[u'Motorcycle Accident Crash During Wheelie on the Highway Crash 2015']
[u'STREETFIGHTERZ']
[]
[u'Motorcycle Crash Compilation 2015 || Ep.#15 of October']
[u'Car Crash Weekly']
[]
[u'Motorcycle Accident Burnout On Highway Crash 2015']
[u'STREETFIGHTERZ']
[]
[u'Streetfighterz Ride The Murder Biz Ride 2015 Insane Motorcycle Stunts']
[u'STREETFIGHTERZ']
In [40]: for question in questions:
print question.xpath('./@href').extract()
[u'/results?filters=movie&lclk=movie&search_query=motorcycle+accident+stunt+rider+knocks+himself+out+stunt+fail+2015']
[u'/results?filters=show&lclk=show&search_query=motorcycle+accident+stunt+rider+knocks+himself+out+stunt+fail+2015']
[u'/results?filters=short&lclk=short&search_query=motorcycle+accident+stunt+rider+knocks+himself+out+stunt+fail+2015']
[u'/results?filters=long&lclk=long&search_query=motorcycle+accident+stunt+rider+knocks+himself+out+stunt+fail+2015']
[u'/results?filters=4k&lclk=4k&search_query=motorcycle+accident+stunt+rider+knocks+himself+out+stunt+fail+2015']
[u'/results?filters=hd&lclk=hd&search_query=motorcycle+accident+stunt+rider+knocks+himself+out+stunt+fail+2015']
[u'/results?filters=cc&lclk=cc&search_query=motorcycle+accident+stunt+rider+knocks+himself+out+stunt+fail+2015']
[u'/results?filters=creativecommons&lclk=creativecommons&search_query=motorcycle+accident+stunt+rider+knocks+himself+out+stunt+fail+2015']
[u'/results?filters=3d&lclk=3d&search_query=motorcycle+accident+stunt+rider+knocks+himself+out+stunt+fail+2015']
[u'/results?filters=live&lclk=live&search_query=motorcycle+accident+stunt+rider+knocks+himself+out+stunt+fail+2015']
[u'/results?filters=purchased&lclk=purchased&search_query=motorcycle+accident+stunt+rider+knocks+himself+out+stunt+fail+2015']
[u'/results?filters=spherical&lclk=spherical&search_query=motorcycle+accident+stunt+rider+knocks+himself+out+stunt+fail+2015']
[u'/results?search_sort=video_date_uploaded&search_query=Motorcycle+Accident+Stunt+Rider+Knocks+Himself+Out+Stunt+Fail+2015']
您已经在<a>
节点内,因此请使用./
选择其中的元素。