调试:使用Scrapy抓取表格时抓取(404)

时间:2020-07-01 01:52:54

标签: python scrapy

我对Scrapy并不陌生,我尝试从此website的每个页面中获取表数据。

enter image description here

但是首先,我只想从page 1获取表数据。

这是我的代码:

import scrapy

class UAESpider(scrapy.Spider):
    name = 'uae_free'

    allowed_domains = ['https://www.uaeonlinedirectory.com']

    start_urls = [
        'https://www.uaeonlinedirectory.com/UFZOnlineDirectory.aspx?item=A'
    ]

    def parse(self, response):
        zones = response.xpath('//table[@class="GridViewStyle"]/tbody/tr')
        for zone in zones[1:]:
            yield {
                'company_name': zone.xpath('.//td[1]//text()').get(),
                'zone': zone.xpath('.//td[2]//text()').get(),
                'category': zone.xpath('.//td[4]//text()').get()
            }

在终端上,我收到此消息:

2020-07-01 08:41:07 [scrapy.core.engine] INFO: Spider opened
2020-07-01 08:41:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-01 08:41:07 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-07-01 08:41:09 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://www.uaeonlinedirectory.com/robots.txt> (referer: None)
2020-07-01 08:41:09 [protego] DEBUG: Rule at line 1 without any user agent to enforce it on.
2020-07-01 08:41:09 [protego] DEBUG: Rule at line 2 without any user agent to enforce it on.
2020-07-01 08:41:09 [protego] DEBUG: Rule at line 8 without any user agent to enforce it on.
2020-07-01 08:41:09 [protego] DEBUG: Rule at line 9 without any user agent to enforce it on.
2020-07-01 08:41:09 [protego] DEBUG: Rule at line 10 without any user agent to enforce it on.
2020-07-01 08:41:09 [protego] DEBUG: Rule at line 11 without any user agent to enforce it on.
2020-07-01 08:41:09 [protego] DEBUG: Rule at line 12 without any user agent to enforce it on.
2020-07-01 08:41:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.uaeonlinedirectory.com/UFZOnlineDirectory.aspx?item=A> (referer: None)
2020-07-01 08:41:14 [scrapy.core.engine] INFO: Closing spider (finished)

你们知道这个消息是什么信息,我的代码有什么问题吗?

更新

我找到了这个answer,并且在设置ROBOTSTXT_OBEY = False之后,我不再收到以上消息。但是我仍然无法获取数据。

设置ROBOTSTXT_OBEY = False后的终端消息:

2020-07-01 08:56:03 [scrapy.core.engine] INFO: Spider opened
2020-07-01 08:56:03 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-01 08:56:03 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-07-01 08:56:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.uaeonlinedirectory.com/UFZOnlineDirectory.aspx?item=A> (referer: None)
2020-07-01 08:56:07 [scrapy.core.engine] INFO: Closing spider (finished)

更新2:

我打开终端并使用scrapy shell https://www.uaeonlinedirectory.com/UFZOnlineDirectory.aspx?item=A检查我的xpath:

>>> response.xpath('//table[@class="GridViewStyle"]')
[<Selector xpath='//table[@class="GridViewStyle"]' data='<table class="GridViewStyle" cellspac...'>]
>>> response.xpath('//table[@class="GridViewStyle"]/tbody')
[]

那么我的xpath错误吗?

1 个答案:

答案 0 :(得分:0)

不知道为什么,但是由于某种原因,您的XPath找不到表主体。我将其更改为此,它现在似乎可以工作:

//table[@class="GridViewStyle"]//tr'