Question

我正在尝试从网站上删除多个页面。为此，我有不同的起始网址和抓取下一页的方法。问题是蜘蛛不会废弃物品，似乎不会抓取指示的页面。我没有结果。你有什么想法解决这个问题吗？

以下是代码

    class ListeCourse_level1(scrapy.Spider):
        name = nom_robot
        allowed_domains = domaine

        start_urls = url_lister()
        print(start_urls)
        print('-----------------------------')

        def parse(self, response):    

            selector = Selector(response)    

            for unElement in response.xpath('//*[@id="td-outer-wrap"]/div[3]/div/div/div[1]/div/div[2]/div[3]/table/tbody/tr'): 
                loader = ItemLoader(JustrunlahItem(), selector=unElement)

                loader.add_xpath('eve_nom_evenement', './/td[2]/div/div[1]/div/a/text()')
                loader.add_xpath('eve_date_deb', './/td[1]/div/text()')
loader.default_input_processor = MapCompose(string) 
                loader.default_output_processor = Join()

                yield loader.load_item()

shell窗口的摘录

--------------------------------------------------
                SCRAPING DES ELEMENTS EVENTS
--------------------------------------------------
2018-02-26 14:13:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.justrunlah.com/running-events-calendar-malaysia/page/9/> (referer: None)
--------------------------------------------------
                SCRAPING DES ELEMENTS EVENTS
--------------------------------------------------
--------------------------------------------------
                SCRAPING DES ELEMENTS EVENTS
--------------------------------------------------
2018-02-26 14:13:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.justrunlah.com/running-events-calendar-malaysia/page/7/> (referer: None)
--------------------------------------------------
                SCRAPING DES ELEMENTS EVENTS
--------------------------------------------------
--------------------------------------------------
                SCRAPING DES ELEMENTS EVENTS
--------------------------------------------------
2018-02-26 14:13:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.justrunlah.com/running-events-calendar-malaysia/page/2/> (referer: None)
--------------------------------------------------
                SCRAPING DES ELEMENTS EVENTS
--------------------------------------------------
2018-02-26 14:13:22 [scrapy.core.engine] INFO: Closing spider (finished)
2018-02-26 14:13:22 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 6899,
 'downloader/request_count': 21,
 'downloader/request_method_count/GET': 21,
 'downloader/response_bytes': 380251,
 'downloader/response_count': 21,
 'downloader/response_status_count/200': 12,
 'downloader/response_status_count/301': 9,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 2, 26, 13, 13, 22, 63002),
 'log_count/DEBUG': 22,
 'log_count/INFO': 7,
 'response_received_count': 12,
 'scheduler/dequeued': 20,
 'scheduler/dequeued/memory': 20,
 'scheduler/enqueued': 20,
 'scheduler/enqueued/memory': 20,
 'start_time': datetime.datetime(2018, 2, 26, 13, 13, 17, 308549)}
2018-02-26 14:13:22 [scrapy.core.engine] INFO: Spider closed (finished)

(C:\Users\guichet-v\AppData\Local\Continuum\anaconda3) C:\Users\guichet-v\Documents\CHALLENGE\02_TRAVAIL\ETAPE_1_WebToSGBD\SCRIPT\justrunlah>

Answer 1

从浏览器的开发人员工具中复制元素xpath将为您提供仅匹配1个元素的内容即使这样，浏览器有时也需要修改html以便能够显示它，并且因为你的xpath是超级特定的，所以有可能你甚至不会得到那一场比赛。

如何解决这个问题？

看一下html，找到相关的元素，类和id，然后自己编写一个xpath 例如，像ren *.* *.jpeg这样简单的内容会匹配您尝试与//tr匹配的所有元素。

Answer 2

正如@stranac所说，问题来自Xpath。目前，当我在Google控制台中复制我的元素的Xpath时，有一个tbody标记。但是这个标签并不在源代码中。正如@gangabass解释here，这是一个常见问题：有时桌面的源HTML中没有tbody标记（现代浏览器会自动将其添加到DOM中）＆＃34;。我删除它，提取工作，但它没有按我的要求组织（一行一个事件）我在一个单元格中有所有提取数据。

Scrapy不会从我的网址中删除项目：Crawled（200）/ Referer：None

2 个答案: