在Google Colab中抓取动态Javascript网站

时间:2020-08-08 07:57:44

标签: python pandas selenium beautifulsoup google-colaboratory

如果在Google Colab中使用Python 3将其定义为<table>,则可以从网站读取Javascript数据:

import requests
import pandas as pd
url = 'https://datatables.net/extensions/buttons/examples/html5/simple.html'
df = pd.read_html(requests.get(url).text)[0]
print(df)

但是,我类似地希望将数据从更复杂的Javscript站点直接读取到Google Coolab中的Python3中。此数据可能未定义为<table>格式。

例如,我想查看哪些日期“售罄”,哪些不在以下站点上: https://shop.perisher.com.au/lift-ticket-calendar

可用(蓝色)日期和售罄(红色)日期之间的差是 here

我曾尝试在Colab的Python3中结合使用Selenium,BeautifulSoup和Pandas来执行此操作,但是并没有成功。

1 个答案:

答案 0 :(得分:0)

分析网站时,网站需要花费一些时间来加载。但是,一旦加载,您就可以分析网站进行的网络调用。该网站进行了ajax调用,以加载有关售罄日期的所有数据。

import requests, json
from bs4 import BeautifulSoup

payload = {
    "inventoryPoolCode": 8452,
    "duration": 1,
    "quantity": 1,
    "productDate": "9 August, 2020"
}

headers = {
    "content-type": "application/x-www-form-urlencoded; charset=UTF-8"
}

res = requests.post("https://shop.perisher.com.au/ProductCalendar/Index", data=payload,headers=headers)
soup = BeautifulSoup(res.text, "html.parser")

data = {'soldout':[], 'notsoldout':[]}
for span in soup.find_all("span", class_="grid-item"):
    if "empty" in span["class"]: continue
    date = span["data-date"].strip()
    if "soldout" in span['class']:data['soldout'].append(date)
    else: data['notsoldout'].append(date)

print("Sold Out Dates")
print(data["soldout"])
print("---" * 25)
print("Available Dates")
print(data["notsoldout"])

输出:

Sold Out Dates
['2020-08-08', '2020-08-09', '2020-08-10', '2020-08-11', '2020-08-13', '2020-08-14', '2020-08-15', '2020-08-16', '2020-08-17', '2020-08-18', '2020-08-20', '2020-08-22', '2020-08-23', '2020-08-29', '2020-08-30']
---------------------------------------------------------------------------
Available Dates
['2020-08-12', '2020-08-19', '2020-08-21', '2020-08-24', '2020-08-25', '2020-08-26', '2020-08-27', '2020-08-28']