Scrapy爬网类会跳过链接,并且不返回响应正文

时间:2019-05-08 12:39:33

标签: python-3.x scrapy web-crawler

现在我正在尝试抓取此网页:http://search.siemens.com/en/?q=iot

为此,我需要提取链接并解析它们,而我刚学到的应该使用Crawl类。但是我的实现似乎不起作用。为了进行测试,我试图从每个网站返回响应正文。不幸的是,蜘蛛只会打开大约三分之一的链接,而不会给我响应正文。

有什么想法我做错了吗?

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class SiemensCrawlSSpider(CrawlSpider):
    name = 'siemens_crawl_s'
    allowed_domains = ['search.siemens.com/en/?q=iot']
    start_urls = ['http://search.siemens.com/en/?q=iot']

    rules = (
        Rule(LinkExtractor(restrict_xpaths='.//dl[@id="search-resultlist"]/dt/a'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        yield response.body

1 个答案:

答案 0 :(得分:1)

categorical_features上设置LOG_LEVEL = 'DEBUG',您会看到一些由于settings.py参数而被过滤的请求

allowed_domains

您可以尝试使用2019-05-10 00:38:27 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.siemens.com': <GET https://www.siemens.com/global/en/home/products/software/mindsphere-iot.html> 2019-05-10 00:38:27 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.industry.siemens.com.cn': <GET https://www.industry.siemens.com.cn/automation/cn/zh/pc-based-automation/industrial-iot/iok2k/Pages/iot.aspx> 2019-05-10 00:38:27 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'w3.siemens.com': <GET https://w3.siemens.com/mcms/pc-based-automation/en/industrial-iot> 2019-05-10 00:38:27 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'new.siemens.com': <GET https://new.siemens.com/global/en/products/services/iot-siemens.html>

或根本不设置allowed_domains = ['siemens.com', 'siemens.com.cn']

https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Spider.allowed_domains