我正在尝试导航到链接并提取数据(该数据是href下载链接),该数据应添加到除首页(从中获得链接)的先前字段之外的新字段中,但是我正在努力做到这一点
首先,我创建了一个解析程序,并提取了第一页的所有链接,并将其添加到名为“链接”的字段中,此链接将重定向到包含下载按钮的页面,因此我需要真实的链接下载按钮,所以我在这里所做的是使用先前的链接创建一个for循环并执行函数yield response.follow,但效果不佳。
import scrapy
class thirdallo(scrapy.Spider):
name = "thirdallo"
start_urls = [
'https://www.alloschool.com/course/alriadhiat-alaol-ibtdaii',
]
def parse(self, response):
yield {
'path': response.css('ol.breadcrumb li a::text').extract(),
'links': response.css('#top .default .er').xpath('@href').extract()
}
hrefs=response.css('#top .default .er').xpath('@href').extract()
for i in hrefs:
yield response.follow(i, callback=self.parse,meta={'finalLink' :response.css('a.btn.btn-primary').xpath('@href)').extract() })
答案 0 :(得分:0)
在@href
中,您试图抓取,似乎您有一些.rar
链接,这些链接无法使用指定的函数进行解析。
使用requests
和lxml
库在下面找到我的代码:
>>> import requests
>>> from lxml import html
>>> s = requests.Session()
>>> resp = s.get('https://www.alloschool.com/course/alriadhiat-alaol-ibtdaii')
>>> doc = html.fromstring(resp.text)
>>> doc.xpath("//*[@id='top']//*//*[@class='default']//*//*[@class='er']/@href")
['https://www.alloschool.com/assets/documents/course-342/jthathat-alftra-1-aldora-1.rar', 'https://www.alloschool.com/assets/documents/course-342/jthathat-alftra-2-aldora-1.rar', 'https://www.alloschool.com/assets/documents/course-342/jthathat-alftra-3-aldora-2.rar', 'https://www.alloschool.com/assets/documents/course-342/jdadat-alftra-4-aldora-2.rar', 'https://www.alloschool.com/element/44905', 'https://www.alloschool.com/element/43081', 'https://www.alloschool.com/element/43082', 'https://www.alloschool.com/element/43083', 'https://www.alloschool.com/element/43084', 'https://www.alloschool.com/element/43085', 'https://www.alloschool.com/element/43086', 'https://www.alloschool.com/element/43087', 'https://www.alloschool.com/element/43088', 'https://www.alloschool.com/element/43080', 'https://www.alloschool.com/element/43089', 'https://www.alloschool.com/element/43090', 'https://www.alloschool.com/element/43091', 'https://www.alloschool.com/element/43092', 'https://www.alloschool.com/element/43093', 'https://www.alloschool.com/element/43094', 'https://www.alloschool.com/element/43095', 'https://www.alloschool.com/element/43096', 'https://www.alloschool.com/element/43097', 'https://www.alloschool.com/element/43098', 'https://www.alloschool.com/element/43099', 'https://www.alloschool.com/element/43100', 'https://www.alloschool.com/element/43101', 'https://www.alloschool.com/element/43102', 'https://www.alloschool.com/element/43103', 'https://www.alloschool.com/element/43104', 'https://www.alloschool.com/element/43105', 'https://www.alloschool.com/element/43106', 'https://www.alloschool.com/element/43107', 'https://www.alloschool.com/element/43108', 'https://www.alloschool.com/element/43109', 'https://www.alloschool.com/element/43110', 'https://www.alloschool.com/element/43111', 'https://www.alloschool.com/element/43112', 'https://www.alloschool.com/element/43113']
在您的代码中,尝试以下操作:
for i in hrefs:
if '.rar' not in i:
yield response.follow(i, callback=self.parse,meta={'finalLink' :response.css('a.btn.btn-primary').xpath('@href)').extract() })