我一直在尝试编写一个脚本,一次从here下载所有外汇对历史数据(zip格式)。 我的问题是,在包含该文件链接的最后一页中,我没有对该文件的任何引用,并且href仅显示: href =“ javascript:return true;”
<a id="a_file" title="Download the zip data file" href="javascript:return true;" target="nullDisplay">HISTDATA_COM_MT_EURUSD_M1_201905.zipHISTDATA_COM_MT_EURUSD_M1_201905.zip</a>
Here是指向其中一个下载页面的链接。
我是python和网络抓取的新手,非常感谢能帮助我朝正确方向前进的任何帮助
答案 0 :(得分:0)
使用chrome dev工具探索发送的确切请求类型,并查找其他正在发送的表单数据和标头。对于您的情况,我找到了标题和数据以下载zip文件。下面的代码应该可以正常工作。使用以下代码之前,只需安装requests
库即可。
resp = requests.post(r'http://www.histdata.com/get.php',
data = {
'tk': '43a87a0c7e650addea7b01a17395a91c',
'date': '2018',
'datemonth': '2018',
'platform': 'MT',
'timeframe': 'M1',
'fxpair': 'EURUSD'
},
headers={
'User-Agent': 'Mozilla/5.1',
'Origin': 'http://www.histdata.com',
'Referer': 'http://www.histdata.com/download-free-forex-historical-data/?/metatrader/1-minute-bar-quotes/eurusd/2018'
})
with open('output.zip', 'wb') as fpw:
for chunk in resp.iter_content():
fpw.write(chunk)
注意:由于它不读取内存中的任何数据,因此它也可以下载大文件。
答案 1 :(得分:0)
抓取下载页面http://www.histdata.com/download-free-forex-historical-data/?/metatrader/1-minute-bar-quotes/eurusd/2018,然后获取name =“ tk” id =“ tk”的值:
div style="display:none;">
<form id="file_down" name="file_down" target="nullDisplay" method="POST" action="/get.php">
<input type="hidden" name="tk" id="tk" value="43a87a0c7e650addea7b01a17395a91c" />
<input type="hidden" name="date" id="date" value="2018" />
<input type="hidden" name="datemonth" id="datemonth" value="2018" />
<input type="hidden" name="platform" id="platform" value="MT" />
<input type="hidden" name="timeframe" id="timeframe" value="M1" />
<input type="hidden" name="fxpair" id="fxpair" value="EURUSD" />
</form>
您还可以获取所有其他ID ...
def downloadzipfile(zipfiletype, zipfiletimeframe, zipfilefxpair, zipfileyear, zipfilemonth):
postuseragent = 'Mozilla/5.1'
postorigin = 'http://www.histdata.com'
posturl = postorigin+'/download-free-forex-historical-data/?/'+zipfiletype+'/'+zipfiletimeframe+'/'+zipfilefxpair+'/'+zipfileyear+'/'+zipfilemonth
targetfolder = 'C:/temp/'
# Get the page and make the soup
r = requests.get(posturl)
data = r.text
soup = BeautifulSoup(data, "lxml")
#div style="display:none;"
table = soup.find("div", style="display:none;")
#print(table)
try:
posttk = table.find('input', {'id': 'tk'}).get('value')
print(posttk)
except:
pass
try:
postdate = table.find('input', {'id': 'date'}).get('value')
print(postdate)
except:
pass
try:
postdatemonth = table.find('input', {'id': 'datemonth'}).get('value')
print(postdatemonth)
except:
pass
try:
postplatform = table.find('input', {'id': 'platform'}).get('value')
print(postplatform)
except:
pass
try:
posttimeframe = table.find('input', {'id': 'timeframe'}).get('value')
print(posttimeframe)
except:
pass
try:
postfxpair = table.find('input', {'id': 'fxpair'}).get('value')
print(postfxpair)
except:
pass
然后,您需要根据请求下载ZIP文件:
targetfilename ='HISTDATA_COM_'+postplatform+'_'+postfxpair+'_'+posttimeframe+postdatemonth+'.zip'
targetpathfilename=targetfolder+targetfilename
print(targetfilename)
print(targetpathfilename)
resp = requests.post(postorigin+'/get.php',
data = {'tk': posttk, 'date': postdate, 'datemonth': postdatemonth, 'platform': postplatform, 'timeframe': posttimeframe, 'fxpair': postfxpair},
headers = {'User-Agent': postuseragent, 'Origin': postorigin, 'Referer': posturl})
然后将其写入HDD并等待其完成写入:
# Wait here for the file to download
result = None
while result is None:
with open(targetpathfilename, 'wb') as fpw:
for chunk in resp.iter_content():
fpw.write(chunk)
time.sleep(1)
result = 1
将这些全部放入具有所需时间范围和时间范围的FXpair循环中,然后您可以自动抓取网站:
print('Extract all ZIPfiles from history fx ')
symbolsub = ["GBPJPY", "GBPUSD", "EURGBP"]
for symbolsubstring in symbolsub:
for yearsub in range (2003, 2020):
for monthsub in range(1, 13):
filetype = 'ascii'
filetimeframe = 'tick-data-quotes'
currencypair = symbolsubstring
fileyear = str(yearsub)
filemonth = str(monthsub)
print(filetype, filetimeframe, currencypair, fileyear, filemonth)
downloadzipfile(filetype, filetimeframe, currencypair, fileyear, filemonth)
如果将以上各部分放在一起并添加导入,将为该站点提供抓取软件。