如果在Google Colab中使用Python 3将其定义为<table>
,则可以从网站读取Javascript数据:
import requests
import pandas as pd
url = 'https://datatables.net/extensions/buttons/examples/html5/simple.html'
df = pd.read_html(requests.get(url).text)[0]
print(df)
但是,我类似地希望将数据从更复杂的Javscript站点直接读取到Google Coolab中的Python3中。此数据可能未定义为<table>
格式。
例如,我想查看哪些日期“售罄”,哪些不在以下站点上: https://shop.perisher.com.au/lift-ticket-calendar
可用(蓝色)日期和售罄(红色)日期之间的差是 here
我曾尝试在Colab的Python3中结合使用Selenium,BeautifulSoup和Pandas来执行此操作,但是并没有成功。
答案 0 :(得分:0)
分析网站时,网站需要花费一些时间来加载。但是,一旦加载,您就可以分析网站进行的网络调用。该网站进行了ajax调用,以加载有关售罄日期的所有数据。
import requests, json
from bs4 import BeautifulSoup
payload = {
"inventoryPoolCode": 8452,
"duration": 1,
"quantity": 1,
"productDate": "9 August, 2020"
}
headers = {
"content-type": "application/x-www-form-urlencoded; charset=UTF-8"
}
res = requests.post("https://shop.perisher.com.au/ProductCalendar/Index", data=payload,headers=headers)
soup = BeautifulSoup(res.text, "html.parser")
data = {'soldout':[], 'notsoldout':[]}
for span in soup.find_all("span", class_="grid-item"):
if "empty" in span["class"]: continue
date = span["data-date"].strip()
if "soldout" in span['class']:data['soldout'].append(date)
else: data['notsoldout'].append(date)
print("Sold Out Dates")
print(data["soldout"])
print("---" * 25)
print("Available Dates")
print(data["notsoldout"])
输出:
Sold Out Dates
['2020-08-08', '2020-08-09', '2020-08-10', '2020-08-11', '2020-08-13', '2020-08-14', '2020-08-15', '2020-08-16', '2020-08-17', '2020-08-18', '2020-08-20', '2020-08-22', '2020-08-23', '2020-08-29', '2020-08-30']
---------------------------------------------------------------------------
Available Dates
['2020-08-12', '2020-08-19', '2020-08-21', '2020-08-24', '2020-08-25', '2020-08-26', '2020-08-27', '2020-08-28']