我正在尝试学习如何从中抓取动态Web数据 https://www.msn.com/en-us/money/stockdetails/history/fi-a1xzim
该网页使用javascript调用下面的网址以返回记录。 https://finance-services.msn.com/Market.svc/ChartAndQuotes?symbols=126.1.MSFT.NAS&chartType=1d&isETF=false&iseod=False&lang=en-US&isCS=true&isVol=true
我尝试了几种方法,但是仍然无法使用python获取记录,并得到“ 403-禁止访问被拒绝”。
import urllib
import urllib.request
# url='https://www.msn.com/en-us/money/stockdetails/history/nas-msft/fi-a1xzim'
url='https://finance-services.msn.com/Market.svc/ChartAndQuotes?symbols=126.1.MSFT.NAS&chartType=1d&isETF=false&iseod=False&lang=en-US&isCS=true&isVol=true'
hdr = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive',
}
req = urllib.request.Request(url, headers=hdr)
page = urllib.request.urlopen(req)
content = page.read()
print (content)
我应该怎么做才能从python获取数据,我们可以在网站上看到这些数据?
非常感谢!
答案 0 :(得分:0)
我可以成功更改标题,如下所示。试试看:
import urllib.request
url = 'https://finance-services.msn.com/Market.svc/ChartAndQuotes?symbols=126.1.MSFT.NAS&chartType=1d&isETF=false&iseod=False&lang=en-US&isCS=true&isVol=true'
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',
'Referer': 'https://www.msn.com/en-us/money/stockdetails/history/fi-a1xzim',
}
req = urllib.request.Request(url, headers=headers)
page = urllib.request.urlopen(req)
print(page.read())
答案 1 :(得分:0)
您要做的是使用Selenium进行抓取,而不是尝试发出get请求。您需要先安装chromedriver或其他浏览器的驱动程序,然后再安装selenium。抓取它会打开浏览器时,您也可以使用选项无头运行它。
This中级文章应该会对您有所帮助。这是一个快速示例:
from selenium import webdriver
url='https://finance-services.msn.com/Market.svc/ChartAndQuotes?symbols=126.1.MSFT.NAS&chartType=1d&isETF=false&iseod=False&lang=en-US&isCS=true&isVol=true'
driver = webdriver.Chrome('Path in your computer where you have installed chromedriver')
resp = driver.get(url)
#do something