我对Scrapy
并不陌生,我尝试从此website的每个页面中获取表数据。
但是首先,我只想从page 1
获取表数据。
这是我的代码:
import scrapy
class UAESpider(scrapy.Spider):
name = 'uae_free'
allowed_domains = ['https://www.uaeonlinedirectory.com']
start_urls = [
'https://www.uaeonlinedirectory.com/UFZOnlineDirectory.aspx?item=A'
]
def parse(self, response):
zones = response.xpath('//table[@class="GridViewStyle"]/tbody/tr')
for zone in zones[1:]:
yield {
'company_name': zone.xpath('.//td[1]//text()').get(),
'zone': zone.xpath('.//td[2]//text()').get(),
'category': zone.xpath('.//td[4]//text()').get()
}
在终端上,我收到此消息:
2020-07-01 08:41:07 [scrapy.core.engine] INFO: Spider opened
2020-07-01 08:41:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-01 08:41:07 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-07-01 08:41:09 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://www.uaeonlinedirectory.com/robots.txt> (referer: None)
2020-07-01 08:41:09 [protego] DEBUG: Rule at line 1 without any user agent to enforce it on.
2020-07-01 08:41:09 [protego] DEBUG: Rule at line 2 without any user agent to enforce it on.
2020-07-01 08:41:09 [protego] DEBUG: Rule at line 8 without any user agent to enforce it on.
2020-07-01 08:41:09 [protego] DEBUG: Rule at line 9 without any user agent to enforce it on.
2020-07-01 08:41:09 [protego] DEBUG: Rule at line 10 without any user agent to enforce it on.
2020-07-01 08:41:09 [protego] DEBUG: Rule at line 11 without any user agent to enforce it on.
2020-07-01 08:41:09 [protego] DEBUG: Rule at line 12 without any user agent to enforce it on.
2020-07-01 08:41:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.uaeonlinedirectory.com/UFZOnlineDirectory.aspx?item=A> (referer: None)
2020-07-01 08:41:14 [scrapy.core.engine] INFO: Closing spider (finished)
你们知道这个消息是什么信息,我的代码有什么问题吗?
更新:
我找到了这个answer,并且在设置ROBOTSTXT_OBEY = False
之后,我不再收到以上消息。但是我仍然无法获取数据。
设置ROBOTSTXT_OBEY = False
后的终端消息:
2020-07-01 08:56:03 [scrapy.core.engine] INFO: Spider opened
2020-07-01 08:56:03 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-01 08:56:03 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-07-01 08:56:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.uaeonlinedirectory.com/UFZOnlineDirectory.aspx?item=A> (referer: None)
2020-07-01 08:56:07 [scrapy.core.engine] INFO: Closing spider (finished)
更新2:
我打开终端并使用scrapy shell https://www.uaeonlinedirectory.com/UFZOnlineDirectory.aspx?item=A
检查我的xpath:
>>> response.xpath('//table[@class="GridViewStyle"]')
[<Selector xpath='//table[@class="GridViewStyle"]' data='<table class="GridViewStyle" cellspac...'>]
>>> response.xpath('//table[@class="GridViewStyle"]/tbody')
[]
那么我的xpath错误吗?
答案 0 :(得分:0)
不知道为什么,但是由于某种原因,您的XPath找不到表主体。我将其更改为此,它现在似乎可以工作:
//table[@class="GridViewStyle"]//tr'