Question

我想收集一些文章的名称和摘要。网站页面如下所示：

Page 1 (list of conferences):
  Conf1, year
  Conf2, yaer
  ....

Page 2 (list of articles for each Conf):
  Article1, title
  Article2, title
  ....

Page 2 (the page for each Article):
  Title
  Abstract

我想收集每个会议的文章（以及有关会议的其他信息，例如年份）。首先，我不知道是否需要为此使用scrapy之类的框架，还是只编写一个python程序。当我检查scrapy时，我可以拥有如下的蜘蛛来收集会议信息：

# -*- coding: utf-8 -*-
import scrapy


class ToScrapeSpiderXPath(scrapy.Spider):
    name = 'toscrape-xpath'
    start_urls = [
        'https://www.aclweb.org/anthology/',
    ]

    def parse(self, response):
        for conf in response.xpath('//*[@id="main-container"]/div/div[2]/main/table[1]/tbody/tr/th/a'):
            yield {
                'name': conf.xpath('./text()').extract_first(),
                'link': conf.xpath('./@href').extract_first(),
            }

        for conf in response.xpath('//*[@id="main-container"]/div/div[2]/main/table[2]/tbody/tr/th/a'):
            yield {
                'name': conf.xpath('./text()').extract_first(),
                'link': conf.xpath('./@href').extract_first(),
            }

        next_page_url = response.xpath('//li[@class="next"]/a/@href').extract_first()
        if next_page_url is not None:
            yield scrapy.Request(response.urljoin(next_page_url))

但是，我必须单击每个会议的链接才能看到文章。我没有找到很多示例来说明如何使用scrapy收集我需要的其余数据。当我为每个会议收集数据时，请您指导我如何抓取文章页面？

Answer 1

您可以编写如下代码

import scrapy


class ToScrapeSpiderXPath(scrapy.Spider):
    name = 'toscrape-xpath'
    start_urls = [
        'https://www.aclweb.org/anthology/',
    ]

    def parse(self, response):
        for conf in response.xpath('//*[@id="main-container"]/div/div[2]/main/table/tbody/tr/th/a'):
            item = {'name': conf.xpath('./text()').extract_first(),
                'link': response.urljoin(conf.xpath('./@href').extract_first())}

            yield scrapy.Request(response.urljoin(conf.xpath('./@href').extract_first()), callback=self.parse_listing,
                             meta={'item': item})

        next_page_url = response.xpath('//li[@class="next"]/a/@href').extract_first()
        if next_page_url:
            yield scrapy.Request(response.urljoin(next_page_url), callback=self.parse)

    def parse_listing(self, response):
        """
        Parse the listing page urls here
        :param response:
        :return:
        """

        # Fetch listing urls Here  == > listing_urls
        # for url in listing_urls:
        #     yield scrapy.Request(url, callback=self.parse_details)

    def parse_details(self, response):
        """
        Parse product details here
        :param response:
        :return:
        """

        # Fetch product details here. ==> details
        # yield details

您还可以查看json输出

scrapy crawl toscrape-xpath -o ouput.csv

如何使用scrapy网页抓取多个页面？

1 个答案: