Question

我希望我的蜘蛛能够抓住每个人的“追随者”和“关注”信息。此时它仅提供了数千个中的6个结果。我怎样才能得到完整的结果？

“items.py”包括：

import scrapy
class HouzzItem(scrapy.Item):
    Following = scrapy.Field()
    Follower= scrapy.Field()

名为“houzzsp.py”的蜘蛛包括：

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor

class HouzzspSpider(CrawlSpider):
    name = "houzzsp"
    allowed_domains = ['www.houzz.com']
    start_urls = ['http://www.houzz.com/professionals']

    rules = [
            Rule(LinkExtractor(restrict_xpaths='//li[@class="sidebar-item"]')),
            Rule(LinkExtractor(restrict_xpaths='//a[@class="navigation-button next"]')),
            Rule(LinkExtractor(restrict_xpaths='//div[@class="name-info"]'),
            callback='parse_items')
    ]    


    def parse_items(self, response):
        page = response.xpath('//div[@class="follow-section profile-l-sidebar "]')
        for titles in page:
            Score = titles.xpath('.//a[@class="following follow-box"]/span[@class="follow-count"]/text()').extract()
            Score1 = titles.xpath('.//a[@class="followers follow-box"]/span[@class="follow-count"]/text()').extract()
            yield {'Following':Score,'Follower':Score1}

编辑：已对规则进行了更改，并且按预期工作。

Answer 1

将scrapy的LinkExtractor与restrict_xpaths参数一起使用时，您无需为要遵循的网址指定确切的xpath。来自scrapy's documentation：

restrict_xpaths（str或list） - 是一个XPath（或XPath列表）定义响应中应提取链接的区域从

因此，我们的想法是指定部分，因此LinkExtractor只会深入查看这些标记，以便找到要关注的链接。

总结一下，不要在a内添加restrict_xpaths标记（@href会更糟糕），因为LinkExtractor会在a内找到import numpy as np L1 = np.array([3, 1, 4, 2, 3, 1]) L2 = np.array([4, 8, 9, 5, 6, 7])个标记你指定的xpath。

在深层网络中制作scrapy抓取文档时遇到问题

1 个答案: