在webscrapping

时间:2016-10-17 10:47:07

标签: python url web-scraping urllib http-status-code-403

我试图通过urllib.request使用网页报废下载过去15年的NSE EoD数据的bhavcopy。

我看到urllib.request表现得很奇怪,它在一个案例中有效,但在另一个案例中它让我错误403访问拒绝..

我使用HTTP标头进行屏蔽,但在一种情况下它失败了..

这是代码

import urllib.request
def downloadCMCSV(year="2001",mon="JAN",dd="01"):
    #baseurl = "https://www.nseindia.com"
    headers = {'Host':'www.nseindia.com:443',
               'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
               'Accept-Encoding':'gzip, deflate, sdch, br',
               'Accept-Language':'en-US,en;q=0.8',
               #'Cookie':'NSE-TEST-1=1809850378.20480.0000; pointer=1; sym1=ONGC; pointerfo=1; underlying1=ONGC; instrument1=FUTSTK; optiontype1=-; expiry1=27OCT2016; strikeprice1=-',
               'Cookie':'NSE-TEST-1=1809850378.20480.0000; pointer=1; sym1=ONGC; pointerfo=1; underlying1=ONGC; instrument1=FUTSTK; optiontype1=-; expiry1=27OCT2016; strikeprice1=-; JSESSIONID=B4CA0543FF4C33FD9EA9D18B95238DB4',
               'Referer':'Referer: https://www.nseindia.com/products/content/equities/equities/archieve_eq.htm',
               'Upgrade-Insecure-Requests':'1',
               'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}
    filename = "cm%s%s%sbhav.csv" % (dd,mon,year)
    urlcm = "https://www.nseindia.com/content/historical/EQUITIES/%s/%s/%s.zip" % (year, mon, filename)
    print(urlcm)
    request = urllib.request.Request(urlcm, headers = headers)
    #print(dir(request))
    #print(request.headers)
    try:
        response = urllib.request.urlopen(request)
    except urllib.error.HTTPError as e:
        if e.code == 404:
            print("Bhavcopy not available for", year, mon, dd)
            return
        print(e.code)
        print(e.read())  
        return
    if response.code == 200:
        print("The response is good", response.length)

if __name__ == "__main__":
    #getAll()
    downloadCMCSV('2001','JAN', '01')
    downloadCMCSV('2016','JAN', '01')

输出如下

https://www.nseindia.com/content/historical/EQUITIES/2001/JAN/cm01JAN2001bhav.csv.zip
403
b'<HTML><HEAD>\n<TITLE>Access Denied</TITLE>\n</HEAD><BODY>\n<H1>Access Denied</H1>\n \nYou don\'t have permission to access "http&#58;&#47;&#47;www&#46;nseindia&#46;com&#47;content&#47;historical&#47;EQUITIES&#47;2001&#47;JAN&#47;cm01JAN2001bhav&#46;csv&#46;zip" on this server.<P>\nReference&#32;&#35;18&#46;33210f17&#46;1476700779&#46;13b4f615\n</BODY>\n</HTML>\n'
https://www.nseindia.com/content/historical/EQUITIES/2016/JAN/cm01JAN2016bhav.csv.zip
The response is good 58943

你能帮我解决一下我的错误吗?

1 个答案:

答案 0 :(得分:0)

传递用户代理'Accept': '*/*' referer 标头:

url = "https://www.nseindia.com/content/historical/EQUITIES/2001/JAN/cm01JAN2001bhav.csv.zip"
r = request.Request(url, headers={'User-Agent': 'mybot', 'Accept': '*/*',
                                  "Referer": "https://www.nseindia.com/products/content/equities/equities/archieve_eq.htm"})

print(request.urlopen(r))

您不需要cookie或任何其他设置。