用美丽的汤解析html文档

时间:2019-02-09 01:33:50

标签: python beautifulsoup

我正在尝试使用漂亮的汤解析html页面。具体来说,我正在查看一个名为“ g_rgTopCurators”的超大型数组,可以将其总结如下:

bins

我试图弄清楚如何正确使用汤.select()来为这个大型数组中的每个策展人获取每个\“ name \”。

lst = [1, 2, 3]
category = pd.cut(lst,bins)

1 个答案:

答案 0 :(得分:1)

由于响应是包含HTML的JSON,HTML包含一个包含更多JSON的脚本元素,所以我的第一种方法是:

import requests
import json
from bs4 import BeautifulSoup

url="https://store.steampowered.com/curators/ajaxgetcurators/render?start=0&count=50"
response = requests.get(url, headers = {"Accept": "application/json"})
loaded_response = response.json() # Get the JSON response containing the HTML containing the required JSON.
results_html = loaded_response['results_html'] # Get the HTML from the JSON
soup = BeautifulSoup(results_html, 'html.parser')
text = soup.find_all('script')[1].text # Get the script element from the HTML.
# Get the JSON in the HTML script element
jn = json.loads(text[text.index("var g_rgTopCurators = ")+ len("var g_rgTopCurators = "):text.index("var fnCreateCapsule")].strip().rstrip(';'))
for i in jn:  # Iterate through JSON
    print (i['name'])

输出:

Cynical Brit Gaming
PC Gamer
Just Good PC Games
...

WGN Chat
Bloody Disgusting Official
Orlygift

有一种更快捷的方法,即以字节解码的形式获取响应,然后将其转义,然后通过字符串操作直接转到所需的JSON:

import requests
import json

url="https://store.steampowered.com/curators/ajaxgetcurators/render?start=0&count=50"
response = requests.get(url, headers = {"Accept": "application/json"})
text = response.content.decode("unicode_escape") # response body as bytes decode and escape
# find the JSON
jn = json.loads(text[text.index("var g_rgTopCurators = ")+ len("var g_rgTopCurators = "):text.index("var fnCreateCapsule")].strip().rstrip(';'))
for i in jn:  # Iterate through JSON
    print (i['name'])