我正在尝试从网站上删除多个页面。为此,我有不同的起始网址和抓取下一页的方法。 问题是蜘蛛不会废弃物品,似乎不会抓取指示的页面。我没有结果。 你有什么想法解决这个问题吗?
以下是代码
class ListeCourse_level1(scrapy.Spider):
name = nom_robot
allowed_domains = domaine
start_urls = url_lister()
print(start_urls)
print('-----------------------------')
def parse(self, response):
selector = Selector(response)
for unElement in response.xpath('//*[@id="td-outer-wrap"]/div[3]/div/div/div[1]/div/div[2]/div[3]/table/tbody/tr'):
loader = ItemLoader(JustrunlahItem(), selector=unElement)
loader.add_xpath('eve_nom_evenement', './/td[2]/div/div[1]/div/a/text()')
loader.add_xpath('eve_date_deb', './/td[1]/div/text()')
loader.default_input_processor = MapCompose(string)
loader.default_output_processor = Join()
yield loader.load_item()
shell窗口的摘录
--------------------------------------------------
SCRAPING DES ELEMENTS EVENTS
--------------------------------------------------
2018-02-26 14:13:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.justrunlah.com/running-events-calendar-malaysia/page/9/> (referer: None)
--------------------------------------------------
SCRAPING DES ELEMENTS EVENTS
--------------------------------------------------
--------------------------------------------------
SCRAPING DES ELEMENTS EVENTS
--------------------------------------------------
2018-02-26 14:13:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.justrunlah.com/running-events-calendar-malaysia/page/7/> (referer: None)
--------------------------------------------------
SCRAPING DES ELEMENTS EVENTS
--------------------------------------------------
--------------------------------------------------
SCRAPING DES ELEMENTS EVENTS
--------------------------------------------------
2018-02-26 14:13:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.justrunlah.com/running-events-calendar-malaysia/page/2/> (referer: None)
--------------------------------------------------
SCRAPING DES ELEMENTS EVENTS
--------------------------------------------------
2018-02-26 14:13:22 [scrapy.core.engine] INFO: Closing spider (finished)
2018-02-26 14:13:22 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 6899,
'downloader/request_count': 21,
'downloader/request_method_count/GET': 21,
'downloader/response_bytes': 380251,
'downloader/response_count': 21,
'downloader/response_status_count/200': 12,
'downloader/response_status_count/301': 9,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 2, 26, 13, 13, 22, 63002),
'log_count/DEBUG': 22,
'log_count/INFO': 7,
'response_received_count': 12,
'scheduler/dequeued': 20,
'scheduler/dequeued/memory': 20,
'scheduler/enqueued': 20,
'scheduler/enqueued/memory': 20,
'start_time': datetime.datetime(2018, 2, 26, 13, 13, 17, 308549)}
2018-02-26 14:13:22 [scrapy.core.engine] INFO: Spider closed (finished)
(C:\Users\guichet-v\AppData\Local\Continuum\anaconda3) C:\Users\guichet-v\Documents\CHALLENGE\02_TRAVAIL\ETAPE_1_WebToSGBD\SCRIPT\justrunlah>
答案 0 :(得分:0)
从浏览器的开发人员工具中复制元素xpath将为您提供仅匹配1个元素的内容 即使这样,浏览器有时也需要修改html以便能够显示它,并且因为你的xpath是超级特定的,所以有可能你甚至不会得到那一场比赛。
如何解决这个问题?
看一下html,找到相关的元素,类和id,然后自己编写一个xpath
例如,像ren *.* *.jpeg
这样简单的内容会匹配您尝试与//tr
匹配的所有元素。
答案 1 :(得分:0)
正如@stranac所说,问题来自Xpath。目前,当我在Google控制台中复制我的元素的Xpath时,有一个tbody
标记。但是这个标签并不在源代码中。正如@gangabass解释here,这是一个常见问题:有时桌面的源HTML中没有tbody
标记(现代浏览器会自动将其添加到DOM中)&#34;。我删除它,提取工作,但它没有按我的要求组织(一行一个事件)我在一个单元格中有所有提取数据。