Question

我无法使用scapry来关注＆＃34;下一页＆＃34;链接 - 根据日志，它指的是自己而不是＆＃34;下一页＆＃34;网址。这是代码：

import scrapy

class QuotesSpider(scrapy.Spider):
name = "quotes2"
start_urls = [
        'http://search.jeffersondeeds.com/pdetail.php?instnum=2016230701&year=2016&db=0&cnum=20',
]

def parse(self, response):
    for quote in response.xpath('//div'):
        yield{
            'record' :  quote.select(".//span/text()").extract()
        }

    next_page = response.xpath('//*[@id="nextpage"]/a/@href').extract()

    if next_page is not None:
        print("GOOOO BUCKS!!")
        next_page = response.urljoin(next_page)
        yield scrapy.Request(next_page, callback=self.parse)
    else:
        print("Ahhh fooey!")

xpath看起来是正确的：

但是被捕获为next_page的网址是原始网址（starts_urls）

Answer 1

next_page不是没有，但它是一个空列表。

现在使用nextpage

中的javascript生成'//table//script/text()'链接

你可以使用：response.xpath('//table//script/text()').re_first("href=\\'(pdetail.*)\\'>")

Scrapy返回原始页面而不是下一页

1 个答案: