我试图从网站上获取数据,要求我在抓取数据之前关注2个网址。
目标是获得如下所示的导出文件:
我的代码如下:
var date1 = new Date(2017, 0, 1)
我已使用以下命令在终端执行蜘蛛:
import scrapy
from scrapy.item import Item, Field
from scrapy import Request
class myItems(Item):
info1 = Field()
info2 = Field()
info3 = Field()
info4 = Field()
class mySpider(scrapy.Spider):
name = 'techbot'
start_urls = ['']
def parse(self, response):
#Extracts first link
items = []
list1 = response.css("").extract() #extract all info from here
for i in list1:
link1 = 'https:...' + str(i)
request = Request(link1, self.parseInfo1, dont_filter =True)
request.meta['item'] = items
yield request
yield items
def parseInfo1(self, response):
#Extracts second link
item = myItems()
items = response.meta['item']
list1 = response.css("").extract()
for i in list1:
link1 = '' + str(i)
request = Request(link1, self.parseInfo2, dont_filter =True)
request.meta['item'] = items
items.append(item)
return request
def parseInfo2(self, response):
#Extracts all data
item = myItems()
items = response.meta['item']
item['info1'] = response.css("").extract()
item['info2'] = response.css("").extract()
item['info3'] = response.css("").extract()
item['info4'] = response.css("").extract()
items.append(item)
return items
我得到的数据不正常,并且有这样的差距:
例如,它多次刮擦第一组数据,其余数据无序。
如果有人能指出我的方向,以更清晰的格式获得结果,如开头所示,将非常感激。
由于
答案 0 :(得分:1)
通过将以下两个链接合并为一个函数而不是两个函数来解决它。我的蜘蛛现在正在运作如下:
class mySpider(scrapy.Spider):
name = 'techbot'
start_urls = ['']
def parse(self, response):
#Extracts links
items = []
list1 = response.css("").extract()
for i in list1:
link1 = 'https:...' + str(i)
request = Request(link2, self.parse, dont_filter =True)
request.meta['item'] = items
yield request
list2 = response.css("").extract()
for i in list2:
link2 = '' + str(i)
request = Request(link1, self.parseInfo2, dont_filter =True)
request.meta['item'] = items
yield request
yield items
def parseInfo2(self, response):
#Extracts all data
item = myItems()
items = response.meta['item']
item['info1'] = response.css("").extract()
item['info2'] = response.css("").extract()
item['info3'] = response.css("").extract()
item['info4'] = response.css("").extract()
items.append(item)
return items