我正在尝试从NSE India网站下载数据。要下载的数据是我下载后处理的zip文件。 我有示例代码,可以下载2016年之后的日期文件。
def start_download():
directory = 'data'
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) '
'Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
try:
#req = urllib2.Request("https://www.nseindia.com/content/historical/EQUITIES//2000/JAN/cm01JAN2000bhav.csv.zip", headers=hdr)
import ipdb;ipdb.set_trace()
req = urllib2.Request("https://www.nseindia.com/content/historical/EQUITIES//2017/NOV/cm03NOV2017bhav.csv.zip", headers=hdr)
file_url = urllib2.urlopen(req)
try:
if not os.path.exists(directory):
os.makedirs(directory)
file_name_obj = open(os.path.join(directory, "hello.zip"), 'wb')
file_name_obj.write(file_url.read())
file_name_obj.close()
except IOError, e:
print e
except Exception, e:
print e
在上面的代码中,当我使用网址" https://www.nseindia.com/content/historical/EQUITIES//2017/NOV/cm03NOV2017bhav.csv.zip"时,它会下载数据。我也试过使用Postman客户端,它也下载了。
当我使用以下网址:https://www.nseindia.com/content/historical/EQUITIES//2000/JAN/cm01JAN2000bhav.csv.zip时,我在代码和邮递员中都会被禁止访问403错误。当我在Chrome浏览器中粘贴此链接时,也会发出问题。
但是,当浏览此页面的链接时," https://www.nseindia.com/products/content/equities/equities/archieve_eq.htm"并将Report
设为Bhavcopy
和date
为2000年1月1日,它成功下载文件* .csv.zip。
如何在示例代码中为已注释的网址修复此403禁止错误?
答案 0 :(得分:0)
您需要调整标题。 下面是一个如何操作的示例以及如何使用Python编写下载的文件:
from urllib.request import Request, urlopen
import shutil
link = 'https://www.nseindia.com/content/historical/EQUITIES//2017/NOV/cm03NOV2017bhav.csv.zip'
header = {
'Accept-Encoding': 'gzip, deflate, sdch, br',
'Accept-Language': 'fr-FR,fr;q=0.8,en-US;q=0.6,en;q=0.4',
'Host': 'www.nseindia.com',
'Referer': 'https://www.nseindia.com/',
'User-Agent': 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/53.0.2785.143 Chrome/53.0.2785.143 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest'
}
def download_file(link, file_name, length):
try:
req = Request(link, headers=header)
with open(file_name, 'wb') as writer:
request = urlopen(req, timeout=3)
shutil.copyfileobj(request, writer, length)
except Exception as e:
print('File cannot be downloaded:', e)
finally:
print('File downloaded with success!')
file_name = 'new_file.zip'
length = 1024
download_file(link, file_name, length)
最后,您可以检查使用此方法下载的文件是否与浏览器下载的文件的SHA1总和相同:
使用Python下载文件:
> sha1sum new_file.zip
daff49646d183636f590db6cbf32c93896179cb2 new_file.zip
使用Chromium下载文件:
> sha1sum cm03NOV2017bhav.csv.zip
daff49646d183636f590db6cbf32c93896179cb2 cm03NOV2017bhav.csv.zip