用Scrapy蜘蛛进行多次扫描

时间:2017-09-06 05:41:39

标签: python web-scraping scrapy scrapy-spider

我想抓一个website。我想要提取的是文档列表,作者姓名和日期。我观看了一些scrapy蜘蛛视频,并能够找出3个shell脚本命令,它从网站上提供所需的数据。命令是

scrapy shell https://www.cato.org/research/34/commentary

日期:

 response.css('span.date-display-single::text').extract()

作者:

response.css('p.text-sans::text').extract()

页面中的文档链接

response.css('p.text-large.experts-more-h > a::text').extract()

我试图通过Python来获取它,但都是徒劳的。由于有多个数据。

这是python代码:

import scrapy
class CatoSpider(scrapy.Spider):

    name = 'cato'

    allowed_domains = ['cato.org']

    start_urls = ['https://www.cato.org/research/34/commentary']


def parse(self, response):

     pass

1 个答案:

答案 0 :(得分:0)

这应该有效。您需要的只是运行此命令: scrapy runspider cato.py -o out.json 但正如我所看到的,链接有错误,你只会从链接中获取文本,而不是href

import scrapy

class CatoItem(scrapy.Item):
    date = scrapy.Field()
    author = scrapy.Field()
    links = scrapy.Field()


class CatoSpider(scrapy.Spider):

    name = 'cato'

    allowed_domains = ['cato.org']

    start_urls = ['https://www.cato.org/research/34/commentary']


    def parse(self, response):
        date = response.css('span.date-display-single::text').extract()
        author = response.css('p.text-sans::text').extract()
        links = response.css('p.text-large.experts-more-h > a::text').extract()
        for d, a, l in zip(date, author, links):
            item = CatoItem()
            item['date'] = d
            item['author'] = a
            item['links'] = l
            yield item