我想抓一个website。我想要提取的是文档列表,作者姓名和日期。我观看了一些scrapy蜘蛛视频,并能够找出3个shell脚本命令,它从网站上提供所需的数据。命令是
scrapy shell https://www.cato.org/research/34/commentary
日期:
response.css('span.date-display-single::text').extract()
作者:
response.css('p.text-sans::text').extract()
页面中的文档链接
response.css('p.text-large.experts-more-h > a::text').extract()
我试图通过Python来获取它,但都是徒劳的。由于有多个数据。
这是python代码:
import scrapy
class CatoSpider(scrapy.Spider):
name = 'cato'
allowed_domains = ['cato.org']
start_urls = ['https://www.cato.org/research/34/commentary']
def parse(self, response):
pass
答案 0 :(得分:0)
这应该有效。您需要的只是运行此命令:
scrapy runspider cato.py -o out.json
但正如我所看到的,链接有错误,你只会从链接中获取文本,而不是href
import scrapy
class CatoItem(scrapy.Item):
date = scrapy.Field()
author = scrapy.Field()
links = scrapy.Field()
class CatoSpider(scrapy.Spider):
name = 'cato'
allowed_domains = ['cato.org']
start_urls = ['https://www.cato.org/research/34/commentary']
def parse(self, response):
date = response.css('span.date-display-single::text').extract()
author = response.css('p.text-sans::text').extract()
links = response.css('p.text-large.experts-more-h > a::text').extract()
for d, a, l in zip(date, author, links):
item = CatoItem()
item['date'] = d
item['author'] = a
item['links'] = l
yield item