我正在尝试使用简单的scrapy蜘蛛跟踪每个链接的链接和废料数据列表,但我遇到了麻烦。
在scrapy shell中,当我重新创建脚本时,它会发送新url的get请求,但是当我运行抓取时,我没有从链接中获取任何数据。我收到的唯一数据来自在转到链接之前被抓取的起始网址。
如何从链接中抓取数据?
import scrapy
class QuotesSpider(scrapy.Spider):
name = "players"
start_urls = ['http://wiki.teamliquid.net/counterstrike/Portal:Teams']
def parse(self, response):
teams = response.xpath('//*[@id="mw-content-text"]/table[1]')
for team in teams.css('span.team-template-text'):
yield{
'teamName': team.css('a::text').extract_first()
}
urls = teams.css('span.team-template-text a::attr(href)')
for url in urls:
url = response.urljoin(url)
yield scrapy.Request(url, callback=self.parse_team_info)
def parse_team_info(self, response):
yield{
'Test': response.css('span::text').extract_first()
}
答案 0 :(得分:2)
而不是使用
url = response.urljoin(url)
yield scrapy.Request(url, callback=self.parse_team_info)
使用
yield response.follow(url, callback=self.parse_team_info)