如何使用Beautiful Soup从<script>中提取内容

时间:2020-01-17 20:52:44

标签: python html json python-3.x beautifulsoup

我正在尝试从此处的脚本标签中的代码中提取campaign_hearts和postal_code(整个代码太长,无法发布):

<script>
...    
"campaign_hearts":4817,"social_share_total":11242,"social_share_last_update":"2020-01-17T10:51:22-06:00","location":{"city":"Los Angeles, CA","country":"US","postal_code":"90012"},"is_partner":false,"partner":{},"is_team":true,"team":{"name":"Team STEVENS NATION","team_pic_url":"https://d2g8igdw686xgo.cloudfront.net
...

我可以使用以下代码标识所需的脚本:

from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
from time import sleep
import requests 
import re
import json


page = requests.get("https://www.gofundme.com/f/eric-stevens-care-trust")

soup = BeautifulSoup(page.content, 'html.parser')

all_scripts = soup.find_all('script')
all_scripts[0]

但是,我不知道如何提取所需的值。 (我是Python的新手。) This thread针对类似问题推荐了以下解决方案(已编辑,以反映我正在使用的html)。

data = json.loads(all_scripts[0].get_text()[27:])

但是,运行它会产生错误:JSONDecodeError: Expecting value: line 1 column 1 (char 0).

既然已经确定了正确的脚本,我该怎么做才能提取所需的值?我还尝试了here列出的解决方案,但是在导入解析器时遇到了问题。

4 个答案:

答案 0 :(得分:3)

您可以使用<script>模块解析json的内容,然后获取您的值。例如:

import re
import json
import requests

url = 'https://www.gofundme.com/f/eric-stevens-care-trust'

txt = requests.get(url).text

data = json.loads(re.findall(r'window\.initialState = ({.*?});', txt)[0])

# print( json.dumps(data, indent=4) )  # <-- uncomment this to see all data

print('Campaign Hearts =', data['feed']['campaign']['campaign_hearts'])
print('Postal Code     =', data['feed']['campaign']['location']['postal_code'])

打印:

Campaign Hearts = 4817
Postal Code     = 90012

答案 1 :(得分:2)

您使用的库更多;代码效率越低!这是一个更简单的解决方案-

#This imports the website content.

import requests
url = "https://www.gofundme.com/f/eric-stevens-care-trust"
a = requests.post(url)
a= (a.content)
print(a)

#These will show your data.

campaign_hearts = str(a,'utf-8').split('campaign_hearts":')[1]
campaign_hearts = campaign_hearts.split(',"social_share_total"')[0]
print(campaign_hearts)

postal_code = str(a,'utf-8').split('postal_code":"')[1]
postal_code = postal_code.split('"},"is_partner')[0]
print(postal_code)   

答案 2 :(得分:1)

您的json.loads由于最后一个分号而失败。如果您使用正则表达式仅提取对象字符串(不包括最后的分号),它将起作用。

from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
from time import sleep
import requests 
import re
import json



page = requests.get("https://www.gofundme.com/f/eric-stevens-care-trust")

soup = BeautifulSoup(page.content, 'html.parser')

all_scripts = soup.find_all('script')
txt = all_scripts[0].get_text()
data = json.loads(re.findall(r'window\.initialState = ({.*?});', txt)[0])

答案 3 :(得分:1)

现在应该没问题,我可能会尝试编写一个纯lxml版本或至少改善对元素的搜索。

此解决方案使用正则表达式仅获取JSON数据,而没有window.initialState =和分号。

import json
import re

import requests
from bs4 import BeautifulSoup

url_1 = "https://www.gofundme.com/f/eric-stevens-care-trust"

req = requests.get(url_1)

soup = BeautifulSoup(req.content, 'lxml')

script_tag = soup.find('script')

raw_json = re.fullmatch(r"window\.initialState = (.+);", script_tag.text).group(1)

json_content = json.loads(raw_json)