Question

我刚开始使用Scrapy和Python，并且一直在遵循本教程，但是却陷入困境。我已经能够使用外壳程序从页面获取链接列表，如下所示：

>>> response.css('li').xpath('a/@href').getall()

给我：

'/shop-online/542/fragrances', '/shop-online/81/vitamins', '/shop-online/257/beauty', '/shop-online/665/skin-care', '/shop-online/648/cosmetics', '/shop-online/517/weight-loss', '/shop-online/20/baby-care', '/shop-online/89/sexual-health', '/shop-online/198/smoking-deterrents', '/shop-online/3240/clearance', '/prescriptions', '/shop-online/258/medicines', '/shop-online/1093/cold-flu', '/shop-online/PS-1755/all-fish-oil-supplements', '/shop-online/159/oral-hygiene-and-dental-care', '/shop-online/792/household', '/shop-online/129/hair-care', '/shop-online/1255/sports-nutrition', '/bestsellers', '/categories', 'https://www.chemistwarehouse.hk', '/', '#', '/login', '/youraccount', '#', '/aboutus', '/aboutus/shipping', '/shop-online/542/fragrances', '/shop-online/81/vitamins', '/shop-online/257/beauty', '/shop-online/665/skin-care', '/shop-online/648/cosmetics', '/shop-online/517/weight-loss', '/shop-online/20/baby-care', '/shop-online/89/sexual-health', '/shop-online/198/smoking-deterrents', '/prescriptions', '/shop-online/258/medicines', '/shop-online/1093/cold-flu', '/shop-online/PS-1755/all-fish-oil-supplements', '/shop-online/159/oral-hygiene-and-dental-care', '/shop-online/792/household', '/shop-online/129/hair-care', '/shop-online/1255/sports-nutrition', '/bestsellers']

我想要做的，至少是现在使用外壳（然后对其进行脚本编写）是能够解析出不包含shop-online的所有链接，然后抓取相应的URL，这将是www..website / 我抓取的链接

但是我不确定该怎么做。我知道您可以使用正则表达式，但是我不确定如何应用它们，即使可以，我也不确定如何告诉scrapy遍历我发现的内容并抓取THOSE页面？

Answer 1

我想[…]解析所有不包含shop-online的链接，然后抓取相应的URL

在Spider回调中，应为：

for link in response.xpath('//li//a/@href[contains(., "/shop-online/")]'):
    yield response.follow(link.get())

在外壳中，您一次只能处理一个请求，因为它仅用于调试目的，因此您只需选择一个链接并提取它即可：

link = response.xpath('//li//a/@href[contains(., "/shop-online/")]').get()  # Gets the first link only
fetch(response.follow(link))

使用scrapy遍历发现的a-href url链接以刮擦相应的页面

1 个答案: