Scrapy不会从我的网址中删除项目:Crawled(200)/ Referer:None

时间:2018-02-26 13:15:40

标签: python web-scraping scrapy

我正在尝试从网站上删除多个页面。为此,我有不同的起始网址和抓取下一页的方法。 问题是蜘蛛不会废弃物品,似乎不会抓取指示的页面。我没有结果。 你有什么想法解决这个问题吗?

以下是代码

    class ListeCourse_level1(scrapy.Spider):
        name = nom_robot
        allowed_domains = domaine

        start_urls = url_lister()
        print(start_urls)
        print('-----------------------------')

        def parse(self, response):    

            selector = Selector(response)    

            for unElement in response.xpath('//*[@id="td-outer-wrap"]/div[3]/div/div/div[1]/div/div[2]/div[3]/table/tbody/tr'): 
                loader = ItemLoader(JustrunlahItem(), selector=unElement)

                loader.add_xpath('eve_nom_evenement', './/td[2]/div/div[1]/div/a/text()')
                loader.add_xpath('eve_date_deb', './/td[1]/div/text()')
loader.default_input_processor = MapCompose(string) 
                loader.default_output_processor = Join()

                yield loader.load_item()

shell窗口的摘录

--------------------------------------------------
                SCRAPING DES ELEMENTS EVENTS
--------------------------------------------------
2018-02-26 14:13:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.justrunlah.com/running-events-calendar-malaysia/page/9/> (referer: None)
--------------------------------------------------
                SCRAPING DES ELEMENTS EVENTS
--------------------------------------------------
--------------------------------------------------
                SCRAPING DES ELEMENTS EVENTS
--------------------------------------------------
2018-02-26 14:13:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.justrunlah.com/running-events-calendar-malaysia/page/7/> (referer: None)
--------------------------------------------------
                SCRAPING DES ELEMENTS EVENTS
--------------------------------------------------
--------------------------------------------------
                SCRAPING DES ELEMENTS EVENTS
--------------------------------------------------
2018-02-26 14:13:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.justrunlah.com/running-events-calendar-malaysia/page/2/> (referer: None)
--------------------------------------------------
                SCRAPING DES ELEMENTS EVENTS
--------------------------------------------------
2018-02-26 14:13:22 [scrapy.core.engine] INFO: Closing spider (finished)
2018-02-26 14:13:22 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 6899,
 'downloader/request_count': 21,
 'downloader/request_method_count/GET': 21,
 'downloader/response_bytes': 380251,
 'downloader/response_count': 21,
 'downloader/response_status_count/200': 12,
 'downloader/response_status_count/301': 9,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 2, 26, 13, 13, 22, 63002),
 'log_count/DEBUG': 22,
 'log_count/INFO': 7,
 'response_received_count': 12,
 'scheduler/dequeued': 20,
 'scheduler/dequeued/memory': 20,
 'scheduler/enqueued': 20,
 'scheduler/enqueued/memory': 20,
 'start_time': datetime.datetime(2018, 2, 26, 13, 13, 17, 308549)}
2018-02-26 14:13:22 [scrapy.core.engine] INFO: Spider closed (finished)

(C:\Users\guichet-v\AppData\Local\Continuum\anaconda3) C:\Users\guichet-v\Documents\CHALLENGE\02_TRAVAIL\ETAPE_1_WebToSGBD\SCRIPT\justrunlah>

2 个答案:

答案 0 :(得分:0)

从浏览器的开发人员工具中复制元素xpath将为您提供仅匹配1个元素的内容 即使这样,浏览器有时也需要修改html以便能够显示它,并且因为你的xpath是超级特定的,所以有可能你甚至不会得到那一场比赛。

如何解决这个问题?

看一下html,找到相关的元素,类和id,然后自己编写一个xpath 例如,像ren *.* *.jpeg这样简单的内容会匹配您尝试与//tr匹配的所有元素。

答案 1 :(得分:0)

正如@stranac所说,问题来自Xpath。目前,当我在Google控制台中复制我的元素的Xpath时,有一个tbody标记。但是这个标签并不在源代码中。正如@gangabass解释here,这是一个常见问题:有时桌面的源HTML中没有tbody标记(现代浏览器会自动将其添加到DOM中)&#34;。我删除它,提取工作,但它没有按我的要求组织(一行一个事件)我在一个单元格中有所有提取数据。