Question

当我尝试使用CrawlSpider从多个网站提取数据时，我是scrapy的新手并且坚持不懈。

这是我的代码：

class ivwSpider(CrawlSpider):

    name = "ivw-online"
    allowed_domains = ["ausweisung.ivw-online.de/"]
    start_urls = ["http://ausweisung.ivw-online.de/index.php?i=1161&a=o44847"]

    pagelink = LinkExtractor(allow=('index.php?i=1161&a=o\d{5}'))
    print(pagelink)
    rules = (Rule(pagelink, callback='parse_item', follow=True), )

    def parse_item(self, response):

        sel = Selector(response)

        item = IVWItem()
        item["Type"] = sel.xpath('//div[@class ="statistik"]//tr[1]//td/text()')[0].extract()
        item["Zeitraum"] = sel.xpath('//div[@class ="tabelle"]//tr[1]//div[@style="width:210px; text-align:center;"]/text()')[0].extract()
        item["Company"] = sel.xpath('//div[@class ="stammdaten"]//tr//td/text()').extract()[-1]
        item["Video_PIs"] = sel.xpath('//div[@class ="tabelle"]//tr[11]//td[@class ="z5"]/text()').extract()
        item["Video_Visits"] = sel.xpath('//div[@class ="tabelle"]//tr[11]//td[@class ="z4"]/text()').extract()
        item["PIs"] = sel.xpath('//div[@class ="statistik"]//tr[3]//td/text()')[1].extract()
        item["Visits"] = sel.xpath('//div[@class ="statistik"]//tr[1]//td/text()')[1].extract()

        return item

执行代码时，不返回任何内容。这是规则定义的问题吗？这里的任何帮助都非常感谢！

Answer 1

虽然start_url已经是一个我无法找到其他竞争对手列表的详细信息页面，但我在网站层次结构中上升到了网址http://ausweisung.ivw-online.de/index.php?i=116作为开头。有一张桌子上有很多竞争对手。

从此start_url，您可以获取所有公司的网址，并直接使用您的回调创建requests，如下所示：

class ivwSpider(scrapy.Spider):

    name = "ivw-online"
    allowed_domains = ["ausweisung.ivw-online.de"]
    start_urls = ["http://ausweisung.ivw-online.de/index.php?i=116"]

    def parse(self, response):

        sel_rows = response.xpath('//div[@class="daten"]/div[@class="tabelle"]//tr')

        for sel_row in sel_rows:
            url_detail = sel_row.xpath('./td[@class="a_main_txt"][1]/a/@href').extract_first()
            if url_detail:
                url = response.urljoin(url_detail)
                # print url
                yield scrapy.Request(url, callback=self.parse_item)

    def parse_item(self, response):

        sel = Selector(response)

        item = IVWItem()
        item["Type"] = sel.xpath('//div[@class ="statistik"]//tr[1]//td/text()')[0].extract()
        item["Zeitraum"] = sel.xpath('//div[@class ="tabelle"]//tr[1]//div[@style="width:210px; text-align:center;"]/text()')[0].extract()
        item["Company"] = sel.xpath('//div[@class ="stammdaten"]//tr//td/text()').extract()[-1]
        item["Video_PIs"] = sel.xpath('//div[@class ="tabelle"]//tr[11]//td[@class ="z5"]/text()').extract()
        item["Video_Visits"] = sel.xpath('//div[@class ="tabelle"]//tr[11]//td[@class ="z4"]/text()').extract()
        item["PIs"] = sel.xpath('//div[@class ="statistik"]//tr[3]//td/text()')[1].extract()
        item["Visits"] = sel.xpath('//div[@class ="statistik"]//tr[1]//td/text()')[1].extract()

        yield item

请注意，基类不再是CrawlSpider，而是Spider。

需要帮助scraw with CrawlSpider

1 个答案: