无法从Johns Hopkins covid-19网站刮擦

时间:2020-08-15 03:26:09

标签: python scrapy

我正在尝试从Johns Hopkins的Covid-19网站上抓取,并尝试使用以下代码:

import scrapy
from datetime import date

class jhSpider(scrapy.Spider):
    name = 'jh'
    
    def start_requests(self):
        
        urls = ['http://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv/']

        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        today = date.today().strftime("%d/%m/%Y")
        today = today[6:11]+today[3:5]+today[0:2]
        filename = 'jhdata_covid_{}.html'.format(today)

        with open(filename, 'wb') as f:
            f.write(response.body)

,但是甚至没有创建html文件。但是,当我用“ http://quotes.toscrape.com/page/1/”替换URL时,一切正常。

1 个答案:

答案 0 :(得分:1)

我认为,在这种情况下,使用scrapy是一个大问题,除非您有充分的理由不单击raw github按钮并右键单击>另存为...

要读取表格,您可以使用pandas read_html()方法,如下所示:

>>> import pandas as pd
>>> pd.read_html('https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv/')

OUT:

[     Unnamed: 0 Province/State      Country/Region  ...  8/11/20  8/12/20  8/13/20
0           NaN            NaN         Afghanistan  ...     1344     1354     1363
1           NaN            NaN             Albania  ...      205      208      213
2           NaN            NaN             Algeria  ...     1322     1333     1341
3           NaN            NaN             Andorra  ...       52       53       53
4           NaN            NaN              Angola  ...       80       80       80
..          ...            ...                 ...  ...      ...      ...      ...
261         NaN            NaN  West Bank and Gaza  ...      104      105      106
262         NaN            NaN      Western Sahara  ...        1        1        1
263         NaN            NaN               Yemen  ...      523      528      528
264         NaN            NaN              Zambia  ...      241      246      246
265         NaN            NaN            Zimbabwe  ...      104      122      128

[266行x 210列]]

如果您有其他想法,需要抓紧时间处理这个非常具体的示例:

您要么:

  • ROBOTSTXT_OBEY中的settings.py设置为False:

      ROBOTSTXT_OBEY = False 
    

  • 像下面这样叫蜘蛛:

      scrapy crawl jh -s ROBOTSTXT_OBEY=False
    

为什么?因为运行完蜘蛛程序后,我在刮擦日志中得到一个指示,表明您未包含在帖子中,该消息被重定向到github的robots.txt后,您的请求被拒绝了

20-08-15 08:54:47 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://github.com/robots.txt> from <GET http://github.com/robots.txt>
2020-08-15 08:54:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://github.com/robots.txt> (referer: None)
2020-08-15 08:54:47 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET http://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv/>