我正在使用Python和Beautiful Soup进行网页抓取。
我遇到了一个问题,我得到的结果包含原始Javascript插值,而不是值本身。
所以不是
<span>2.4%</span>
我得到:
<span> {{ item.rate }} </span>
从美丽的汤中得到的结果。
a)我在做错什么吗(相似的代码可以在其他网站上运行,所以我认为不是,但可能是错误的)?
或
b)有没有办法解决这个问题?
我的代码:
url = "https://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
divs = soup.findAll("ul", {"class": "result-table--grid"})
print(div[0])
谢谢!
答案 0 :(得分:0)
您可以通过以下方式访问json格式响应。然后使用json_normalize
。现在,您将在列中看到接下来的列表/字典。因此,我将提供第二种解决方案,也可以将其展平,但这实际上将水平扩展您的表
代码1
import requests
from bs4 import BeautifulSoup
from pandas.io.json import json_normalize
import pandas as pd
url = "https://www.moneysupermarket.com/mortgages/results/#?goal=1&property=170000&borrow=150000&types=1&types=2&types=3&types=4&types=5"
request_url = 'https://www.moneysupermarket.com/bin/services/aggregation'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'}
payload = {
'channelId': '55',
'enquiryId': '2e619c17-061a-4812-adad-40a9f9d8dcbc',
'limit': '20',
'offset': '0',
'sort': 'initialMonthlyPayment'}
jsonObj = requests.get(request_url, headers=headers, params = payload).json()
results = pd.DataFrame()
for each in jsonObj['results']:
temp_df = json_normalize(each['quote'])
results = results.append(temp_df).reset_index(drop=True)
输出1:
print (results)
@class ... trackerDescription
0 com.moneysupermarket.mortgages.entity.Mortgage... ...
1 com.moneysupermarket.mortgages.entity.Mortgage... ...
2 com.moneysupermarket.mortgages.entity.Mortgage... ...
3 com.moneysupermarket.mortgages.entity.Mortgage... ...
4 com.moneysupermarket.mortgages.entity.Mortgage... ...
5 com.moneysupermarket.mortgages.entity.Mortgage... ...
6 com.moneysupermarket.mortgages.entity.Mortgage... ...
7 com.moneysupermarket.mortgages.entity.Mortgage... ...
8 com.moneysupermarket.mortgages.entity.Mortgage... ...
9 com.moneysupermarket.mortgages.entity.Mortgage... ...
10 com.moneysupermarket.mortgages.entity.Mortgage... ...
11 com.moneysupermarket.mortgages.entity.Mortgage... ...
12 com.moneysupermarket.mortgages.entity.Mortgage... ...
13 com.moneysupermarket.mortgages.entity.Mortgage... ...
14 com.moneysupermarket.mortgages.entity.Mortgage... ...
15 com.moneysupermarket.mortgages.entity.Mortgage... ... after 26 Months,BBR + 3.99% for the remaining ...
16 com.moneysupermarket.mortgages.entity.Mortgage... ...
17 com.moneysupermarket.mortgages.entity.Mortgage... ...
18 com.moneysupermarket.mortgages.entity.Mortgage... ...
19 com.moneysupermarket.mortgages.entity.Mortgage... ... after 26 Months,BBR + 3.99% for the remaining ...
[20 rows x 51 columns]
代码2:
import requests
import pandas as pd
url = "https://www.moneysupermarket.com/mortgages/results/#?goal=1&property=170000&borrow=150000&types=1&types=2&types=3&types=4&types=5"
request_url = 'https://www.moneysupermarket.com/bin/services/aggregation'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'}
payload = {
'channelId': '55',
'enquiryId': '2e619c17-061a-4812-adad-40a9f9d8dcbc',
'limit': '20',
'offset': '0',
'sort': 'initialMonthlyPayment'}
data = requests.get(request_url, headers=headers, params = payload).json()
def flatten_json(y):
out = {}
def flatten(x, name=''):
if type(x) is dict:
for a in x:
flatten(x[a], name + a + '_')
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + '_')
i += 1
else:
out[name[:-1]] = x
flatten(y)
return out
results = pd.DataFrame()
for each in data['results']:
flat = flatten_json(each)
temp_df = pd.DataFrame([flat], columns = flat.keys())
results = results.append(temp_df).reset_index(drop=True)
输出2:
print (results)
apply_active apply_desktop ... straplineLinkLabel topTip
0 True True ... None None
1 True True ... None None
2 True True ... None None
3 True True ... None None
4 True True ... None None
5 True True ... None None
6 True True ... None None
7 True True ... None None
8 True True ... None None
9 True True ... None None
10 True True ... None None
11 True True ... None None
12 True True ... None None
13 True True ... None None
14 True True ... None None
15 True True ... None None
16 True True ... None None
17 True True ... None None
18 True True ... None None
19 True True ... None None
[20 rows x 131 columns]