您好,感谢您花时间帮助我们。
问题:我试图链接蜘蛛以继续遍历下一页 但它没有用,所以我希望得到一些关于我做错的指示。
class infoSpider(scrapy.Spider):
name = 'info_spider'
start_urls = ['https://www.youtube.com/results?search_query=cars']
def parse(self, response):
SET_SELECTOR = '.yt-lockup'
for content in response.css(SET_SELECTOR):
NAME_SELECTOR = '.yt-lockup-byline a ::text'
IMAGE_SELECTOR = 'img ::attr(src)'
yield {
'name': content.css(NAME_SELECTOR).extract_first(),
'image': content.css(IMAGE_SELECTOR).extract_first(),
}
NEXT_PAGE_SELECTOR = '.yt-uix-button-content a ::attr(href)'
next_page = response.css(NEXT_PAGE_SELECTOR).extract_first()
if next_page:
yield scrapy.Request(
response.urljoin(next_page),
callback=self.parse
)
更新:它返回一些结果,但不会继续遍历到下一页。我找到它为每个下一个按钮提供一个随机键 我需要找到解决办法。
如果您需要更多信息,请告知我们(请不要请点击!)
提前致谢。
答案 0 :(得分:2)
您将要使用Link Extractor。您可以使用规则集指定下一页链接。这是官方文档 https://doc.scrapy.org/en/latest/topics/link-extractors.html
class infoSpider(scrapy.Spider):
name = 'info_spider'
start_urls = ['https://www.youtube.com/results?search_query=cars']
rules = (
Rule(LinkExtractor(allow=(), restrict_css=('.yt-uix-button-content a ::attr(href)')), callback="parse_page", follow=True),
)
def parse_page(self, response):
SET_SELECTOR = '.yt-lockup'
for content in response.css(SET_SELECTOR):
NAME_SELECTOR = '.yt-lockup-byline a ::text'
IMAGE_SELECTOR = 'img ::attr(src)'
yield {
'name': content.css(NAME_SELECTOR).extract_first(),
'image': content.css(IMAGE_SELECTOR).extract_first(),
}
答案 1 :(得分:0)
更改
NEXT_PAGE_SELECTOR = '.yt-uix-button-content a ::attr(href)'
到
NEXT_PAGE_SELECTOR = '.yt-uix-button-content a::attr(href)'
或者将最后一行代码更改为
try:
next_page = response.css('.yt-uix-button-content a::attr(href)').extract()[0]
yield scrapy.Request(
response.urljoin(next_page),
callback=self.parse
)
except IndexError:
pass
答案 2 :(得分:0)
我建议你使用 RULES
rules = (
Rule(
LinkExtractor(
restrict_xpaths='//*[contains(@class, "yt-uix-button-content")]/a'),
callback='self.parse'),
)
另一个建议,不要覆盖解析功能。