我正在尝试从此处的脚本标签中的代码中提取campaign_hearts和postal_code(整个代码太长,无法发布):
<script>
...
"campaign_hearts":4817,"social_share_total":11242,"social_share_last_update":"2020-01-17T10:51:22-06:00","location":{"city":"Los Angeles, CA","country":"US","postal_code":"90012"},"is_partner":false,"partner":{},"is_team":true,"team":{"name":"Team STEVENS NATION","team_pic_url":"https://d2g8igdw686xgo.cloudfront.net
...
我可以使用以下代码标识所需的脚本:
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
from time import sleep
import requests
import re
import json
page = requests.get("https://www.gofundme.com/f/eric-stevens-care-trust")
soup = BeautifulSoup(page.content, 'html.parser')
all_scripts = soup.find_all('script')
all_scripts[0]
但是,我不知道如何提取所需的值。 (我是Python的新手。) This thread针对类似问题推荐了以下解决方案(已编辑,以反映我正在使用的html)。
data = json.loads(all_scripts[0].get_text()[27:])
但是,运行它会产生错误:JSONDecodeError: Expecting value: line 1 column 1 (char 0).
既然已经确定了正确的脚本,我该怎么做才能提取所需的值?我还尝试了here列出的解决方案,但是在导入解析器时遇到了问题。
答案 0 :(得分:3)
您可以使用<script>
模块解析json
的内容,然后获取您的值。例如:
import re
import json
import requests
url = 'https://www.gofundme.com/f/eric-stevens-care-trust'
txt = requests.get(url).text
data = json.loads(re.findall(r'window\.initialState = ({.*?});', txt)[0])
# print( json.dumps(data, indent=4) ) # <-- uncomment this to see all data
print('Campaign Hearts =', data['feed']['campaign']['campaign_hearts'])
print('Postal Code =', data['feed']['campaign']['location']['postal_code'])
打印:
Campaign Hearts = 4817
Postal Code = 90012
答案 1 :(得分:2)
您使用的库更多;代码效率越低!这是一个更简单的解决方案-
#This imports the website content.
import requests
url = "https://www.gofundme.com/f/eric-stevens-care-trust"
a = requests.post(url)
a= (a.content)
print(a)
#These will show your data.
campaign_hearts = str(a,'utf-8').split('campaign_hearts":')[1]
campaign_hearts = campaign_hearts.split(',"social_share_total"')[0]
print(campaign_hearts)
postal_code = str(a,'utf-8').split('postal_code":"')[1]
postal_code = postal_code.split('"},"is_partner')[0]
print(postal_code)
答案 2 :(得分:1)
您的json.loads
由于最后一个分号而失败。如果您使用正则表达式仅提取对象字符串(不包括最后的分号),它将起作用。
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
from time import sleep
import requests
import re
import json
page = requests.get("https://www.gofundme.com/f/eric-stevens-care-trust")
soup = BeautifulSoup(page.content, 'html.parser')
all_scripts = soup.find_all('script')
txt = all_scripts[0].get_text()
data = json.loads(re.findall(r'window\.initialState = ({.*?});', txt)[0])
答案 3 :(得分:1)
现在应该没问题,我可能会尝试编写一个纯lxml版本或至少改善对元素的搜索。
此解决方案使用正则表达式仅获取JSON数据,而没有window.initialState =
和分号。
import json
import re
import requests
from bs4 import BeautifulSoup
url_1 = "https://www.gofundme.com/f/eric-stevens-care-trust"
req = requests.get(url_1)
soup = BeautifulSoup(req.content, 'lxml')
script_tag = soup.find('script')
raw_json = re.fullmatch(r"window\.initialState = (.+);", script_tag.text).group(1)
json_content = json.loads(raw_json)