Scrapy忽略了第二页的内容

时间:2017-09-18 20:21:09

标签: python python-3.x web-scraping scrapy scrapy-spider

我在python scrapy中写了一个小刮刀来解析网页上的不同名称。该页面通过分页遍历了4页。整个页面的总名称是46,但它正在抓36个名字。

刮刀应该跳过第一个着陆页的内容,但在我的刮刀中使用parse_start_url参数我已经处理过了。

然而,我现在面对的问题是,它出乎意料地跳过了第二页的内容并解析了所有其他内容,我的意思是第一页,第三页,第四页等等。它为什么会发生以及如何处理?提前致谢。

这是我正在尝试的脚本:

import scrapy

class DataokSpider(scrapy.Spider):

    name = "dataoksp"
    start_urls = ["https://data.ok.gov/browse?page=1&f[0]=bundle_name%3ADataset&f[1]=im_field_categories%3A4191"]

    def parse(self, response):
        for link in response.css('.pagination .pager-item a'):
            new_link = link.css("::attr(href)").extract_first()
            yield scrapy.Request(url=response.urljoin(new_link), callback=self.target_page)

    def target_page(self, response):
        parse_start_url = self.target_page  # I used this argument to capture the content of first page
        for titles in response.css('.title a'):
            name = titles.css("::text").extract_first()
            yield {'Name':name}

2 个答案:

答案 0 :(得分:1)

解决方案非常简单。我已经修好了。

import scrapy

class DataokSpider(scrapy.Spider):

    name = "dataoksp"
    start_urls = ["https://data.ok.gov/browse?f[0]=bundle_name%3ADataset&f[1]=im_field_categories%3A4191"]

    def parse(self, response):
        for f_link in self.start_urls:
            yield response.follow(url=f_link, callback=self.target_page) #this is line which fixes the issue

        for link in response.css('.pagination .pager-item a'):
            new_link = link.css("::attr(href)").extract_first()
            yield response.follow(url=new_link, callback=self.target_page)

    def target_page(self, response):
        for titles in response.css('.title a'):
            name = titles.css("::text").extract_first()
            yield {'Name':name}

现在它给了我所有的结果。

答案 1 :(得分:0)

因为您在 start_urls 中指定的链接实际上是第二页的链接。如果您打开它,您会看到当前页面没有<a>标记。这就是第2页未达到target_page的原因,因此,您应该将 start_urls 指向:

https://data.ok.gov/browse?f[0]=bundle_name%3ADataset&f[1]=im_field_categories%3A4191

此代码可以帮助您:

import scrapy
from scrapy.http import Request


class DataokspiderSpider(scrapy.Spider):
    name = 'dataoksp'
    allowed_domains = ['data.ok.gov']
    start_urls = ["https://data.ok.gov/browse?f[0]=bundle_name%3ADataset&f[1]=im_field_categories%3A4191",]

    def parse(self, response):
        for titles in response.css('.title a'):
            name = titles.css("::text").extract_first()
            yield {'Name':name}

        next_page = response.xpath('//li[@class="pager-next"]/a/@href').extract_first()
        if next_page:
            yield Request("https://data.ok.gov{}".format(next_page), callback=self.parse)

统计信息(请参阅item_scraped_count):

{
    'downloader/request_bytes': 2094,
    'downloader/request_count': 6,
    'downloader/request_method_count/GET': 6,
    'downloader/response_bytes': 45666,
    'downloader/response_count': 6,
    'downloader/response_status_count/200': 6,
    'finish_reason': 'finished',
    'finish_time': datetime.datetime(2017, 9, 19, 7, 23, 47, 801934),
    'item_scraped_count': 46,
    'log_count/DEBUG': 53,
    'log_count/INFO': 7,
    'memusage/max': 47509504,
    'memusage/startup': 47509504,
    'request_depth_max': 4,
    'response_received_count': 6,
    'scheduler/dequeued': 5,
    'scheduler/dequeued/memory': 5,
    'scheduler/enqueued': 5,
    'scheduler/enqueued/memory': 5,
    'start_time': datetime.datetime(2017, 9, 19, 7, 23, 46, 59360)
}