我正在尝试从一个网站上下载表格,该网站具有一个可供下载的下拉菜单(onclick HTML标记)。
如何执行onlick选项以自动下载表格?这是我编写的代码:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
url = 'https://www.iexindia.com/marketdata/rtm_market_snapshot.aspx'
request = Request(url , headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(request).read()
soup = soup(webpage, "lxml")
table=soup.find_all('table')[1]
properties = table.find_all('a', onclick=True)[0]
这是我需要执行的标签:
<a alt="Excel" href="javascript:void(0)" onclick="$find('ctl00_InnerContent_reportViewer').exportReport('EXCELOPENXML');" style="color:#3366CC;font-family:Verdana;font-size:8pt;padding:3px 8px 3px 8px;display:block;white-space:nowrap;text-decoration:none;" title="Excel">
答案 0 :(得分:0)
BeautifulSoup
仅用于HTML解析。
要与网页互动,您应该使用selenium
答案 1 :(得分:0)
此脚本会将表保存到data.xls
文件中:
import re
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}
url = 'https://www.iexindia.com/Reserved.ReportViewerWebControl.axd?Culture=1033&CultureOverrides=True&UICulture=1033&UICultureOverrides=True&ReportStack=1&ControlID={control_id}&Mode=true&OpType=Export&FileName=MarketMinute&ContentDisposition=OnlyHtmlInline&Format=EXCELOPENXML'
with requests.session() as s, open('data.xls', 'wb') as f_out:
soup = BeautifulSoup(s.get('https://www.iexindia.com/marketdata/rtm_market_snapshot.aspx', headers=headers).content, 'html.parser')
img = soup.select_one('img[src*="ControlID"]')
control_id = re.search(r'ControlID=([a-f\d]+)', img['src'])[1]
f_out.write( s.get(url.format(control_id=control_id), headers=headers).content )
LibreOffice的屏幕截图: