Question

我对Scrapy并不陌生，我尝试从此website的每个页面中获取表数据。

但是首先，我只想从page 1获取表数据。

这是我的代码：

import scrapy

class UAESpider(scrapy.Spider):
    name = 'uae_free'

    allowed_domains = ['https://www.uaeonlinedirectory.com']

    start_urls = [
        'https://www.uaeonlinedirectory.com/UFZOnlineDirectory.aspx?item=A'
    ]

    def parse(self, response):
        zones = response.xpath('//table[@class="GridViewStyle"]/tbody/tr')
        for zone in zones[1:]:
            yield {
                'company_name': zone.xpath('.//td[1]//text()').get(),
                'zone': zone.xpath('.//td[2]//text()').get(),
                'category': zone.xpath('.//td[4]//text()').get()
            }

在终端上，我收到此消息：

2020-07-01 08:41:07 [scrapy.core.engine] INFO: Spider opened
2020-07-01 08:41:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-01 08:41:07 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-07-01 08:41:09 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://www.uaeonlinedirectory.com/robots.txt> (referer: None)
2020-07-01 08:41:09 [protego] DEBUG: Rule at line 1 without any user agent to enforce it on.
2020-07-01 08:41:09 [protego] DEBUG: Rule at line 2 without any user agent to enforce it on.
2020-07-01 08:41:09 [protego] DEBUG: Rule at line 8 without any user agent to enforce it on.
2020-07-01 08:41:09 [protego] DEBUG: Rule at line 9 without any user agent to enforce it on.
2020-07-01 08:41:09 [protego] DEBUG: Rule at line 10 without any user agent to enforce it on.
2020-07-01 08:41:09 [protego] DEBUG: Rule at line 11 without any user agent to enforce it on.
2020-07-01 08:41:09 [protego] DEBUG: Rule at line 12 without any user agent to enforce it on.
2020-07-01 08:41:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.uaeonlinedirectory.com/UFZOnlineDirectory.aspx?item=A> (referer: None)
2020-07-01 08:41:14 [scrapy.core.engine] INFO: Closing spider (finished)

你们知道这个消息是什么信息，我的代码有什么问题吗？

更新：

我找到了这个answer，并且在设置ROBOTSTXT_OBEY = False之后，我不再收到以上消息。但是我仍然无法获取数据。

设置ROBOTSTXT_OBEY = False后的终端消息：

2020-07-01 08:56:03 [scrapy.core.engine] INFO: Spider opened
2020-07-01 08:56:03 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-01 08:56:03 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-07-01 08:56:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.uaeonlinedirectory.com/UFZOnlineDirectory.aspx?item=A> (referer: None)
2020-07-01 08:56:07 [scrapy.core.engine] INFO: Closing spider (finished)

更新2：

我打开终端并使用scrapy shell https://www.uaeonlinedirectory.com/UFZOnlineDirectory.aspx?item=A检查我的xpath：

>>> response.xpath('//table[@class="GridViewStyle"]')
[<Selector xpath='//table[@class="GridViewStyle"]' data='<table class="GridViewStyle" cellspac...'>]
>>> response.xpath('//table[@class="GridViewStyle"]/tbody')
[]

那么我的xpath错误吗？

Answer 1

不知道为什么，但是由于某种原因，您的XPath找不到表主体。我将其更改为此，它现在似乎可以工作：

//table[@class="GridViewStyle"]//tr'

调试：使用Scrapy抓取表格时抓取（404）

1 个答案: