Question

我想抓一个website。我想要提取的是文档列表，作者姓名和日期。我观看了一些scrapy蜘蛛视频，并能够找出3个shell脚本命令，它从网站上提供所需的数据。命令是

scrapy shell https://www.cato.org/research/34/commentary

日期：

 response.css('span.date-display-single::text').extract()

作者：

response.css('p.text-sans::text').extract()

页面中的文档链接

response.css('p.text-large.experts-more-h > a::text').extract()

我试图通过Python来获取它，但都是徒劳的。由于有多个数据。

这是python代码：

import scrapy
class CatoSpider(scrapy.Spider):

    name = 'cato'

    allowed_domains = ['cato.org']

    start_urls = ['https://www.cato.org/research/34/commentary']


def parse(self, response):

     pass

Answer 1

这应该有效。您需要的只是运行此命令： scrapy runspider cato.py -o out.json 但正如我所看到的，链接有错误，你只会从链接中获取文本，而不是href

import scrapy

class CatoItem(scrapy.Item):
    date = scrapy.Field()
    author = scrapy.Field()
    links = scrapy.Field()


class CatoSpider(scrapy.Spider):

    name = 'cato'

    allowed_domains = ['cato.org']

    start_urls = ['https://www.cato.org/research/34/commentary']


    def parse(self, response):
        date = response.css('span.date-display-single::text').extract()
        author = response.css('p.text-sans::text').extract()
        links = response.css('p.text-large.experts-more-h > a::text').extract()
        for d, a, l in zip(date, author, links):
            item = CatoItem()
            item['date'] = d
            item['author'] = a
            item['links'] = l
            yield item

用Scrapy蜘蛛进行多次扫描

1 个答案: