Question

我正在尝试逐页抓取具有不常见网页结构的网站，直到我到达要从中提取数据的项目为止，

编辑（通过回答，我已经能够提取所需的大多数数据，但是我需要路径链接才能获得所述产品）

这是我到目前为止的代码：

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):

    name = 'drapertools.com'
    start_urls = ['https://www.drapertools.com/category/0/Product%20Range']

    rules = (
        Rule(LinkExtractor(allow=['/category-?.*?/'])),
        Rule(LinkExtractor(allow=['/product/']), callback='parse_product'),
    )

    def parse_product(self, response):

        yield {
            'product_name': response.xpath('//div[@id="product-title"]//h1[@class="text-primary"]/text()').extract_first(),
            'product_number': response.xpath('//div[@id="product-title"]//h1[@style="margin-bottom: 20px; color:#000000; font-size: 23px;"]/text()').extract_first(),
            'product_price': response.xpath('//div[@id="product-title"]//p/text()').extract_first(),
            'product_desc': response.xpath('//div[@class="col-md-6 col-sm-6 col-xs-12 pull-left"]//div[@class="col-md-11 col-sm-11 col-xs-11"]//p/text()').extract_first(),
            'product_path': response.xpath('//div[@class="nav-container"]//ol[@class="breadcrumb"]//li//a/text()').extract(),
            'product_path_links': response.xpath('//div[@class="nav-container"]//ol[@class="breadcrumb"]//li//a/href()').extract(),
        }

我不知道这是否可行，任何人都可以在这里帮助我吗？我将不胜感激。

更多信息：我正在尝试访问其中的所有类别和所有项目但是其中有一个类别，在我到达该项目之前，还有更多类别。

我正在考虑使用Guillaume的LinkExtractor代码，但是我不确定该用于我想要的结果...

rules = (
        Rule(LinkExtractor(allow=['/category-?.*?/'])),
        Rule(LinkExtractor(allow=['/product/']), callback='parse_product'),
    )

Answer 1

所有页面的结构都相同，也许您可以缩短它？

tt_content {
  myext_iconlinks {
    dataProcessing {
        1 = TYPO3\CMS\Frontend\DataProcessing\DatabaseQueryProcessor
        1 {
            if.isTrue.field = tx_myext_iconlink
            table = tx_myext_domain_model_iconlink
            join = tx_myext_ttcontent_iconlink_mm AS MM ON MM.uid_foreign = tx_myext_domain_model_iconlink.uid
            as = iconlinks
            orderBy = MM.sorting
        }
     }
  }
}

Answer 2

为什么不使用CrawlSpider呢！对于这个用例来说是完美的！

基本上，它以递归方式获取每个页面的所有链接，并仅对感兴趣的页面调用回调（我假设您对产品感兴趣）。

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):

    name = 'drapertools.com'
    start_urls = ['https://www.drapertools.com/category/0/Product%20Range']

    rules = (
        Rule(LinkExtractor(allow=['/category-?.*?/'])),
        Rule(LinkExtractor(allow=['/product/']), callback='parse_product'),
    )

    def parse_product(self, response):

        yield {
            'product_name': response.xpath('//div[@id="product-title"]//h1[@class="text-primary"]/text()').extract_first(),
        }

搜寻具有类别的网页

2 个答案: