我正在尝试从Johns Hopkins的Covid-19网站上抓取,并尝试使用以下代码:
import scrapy
from datetime import date
class jhSpider(scrapy.Spider):
name = 'jh'
def start_requests(self):
urls = ['http://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv/']
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
today = date.today().strftime("%d/%m/%Y")
today = today[6:11]+today[3:5]+today[0:2]
filename = 'jhdata_covid_{}.html'.format(today)
with open(filename, 'wb') as f:
f.write(response.body)
,但是甚至没有创建html文件。但是,当我用“ http://quotes.toscrape.com/page/1/”替换URL时,一切正常。
答案 0 :(得分:1)
我认为,在这种情况下,使用scrapy是一个大问题,除非您有充分的理由不单击raw
github按钮并右键单击>另存为...
要读取表格,您可以使用pandas
read_html()
方法,如下所示:
>>> import pandas as pd
>>> pd.read_html('https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv/')
OUT:
[ Unnamed: 0 Province/State Country/Region ... 8/11/20 8/12/20 8/13/20
0 NaN NaN Afghanistan ... 1344 1354 1363
1 NaN NaN Albania ... 205 208 213
2 NaN NaN Algeria ... 1322 1333 1341
3 NaN NaN Andorra ... 52 53 53
4 NaN NaN Angola ... 80 80 80
.. ... ... ... ... ... ... ...
261 NaN NaN West Bank and Gaza ... 104 105 106
262 NaN NaN Western Sahara ... 1 1 1
263 NaN NaN Yemen ... 523 528 528
264 NaN NaN Zambia ... 241 246 246
265 NaN NaN Zimbabwe ... 104 122 128
[266行x 210列]]
如果您有其他想法,需要抓紧时间处理这个非常具体的示例:
您要么:
将ROBOTSTXT_OBEY
中的settings.py
设置为False:
ROBOTSTXT_OBEY = False
或
像下面这样叫蜘蛛:
scrapy crawl jh -s ROBOTSTXT_OBEY=False
为什么?因为运行完蜘蛛程序后,我在刮擦日志中得到一个指示,表明您未包含在帖子中,该消息被重定向到github的robots.txt后,您的请求被拒绝了
20-08-15 08:54:47 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://github.com/robots.txt> from <GET http://github.com/robots.txt>
2020-08-15 08:54:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://github.com/robots.txt> (referer: None)
2020-08-15 08:54:47 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET http://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv/>