我已经在python中创建了一个脚本,以从当前以json形式存储在links
变量中的网页获取不同的文本。我无法进一步提取所有可用的链接。
这是我的尝试:
import json
import requests
from bs4 import BeautifulSoup
url = 'https://www.afterpay.com/en-AU/categories'
r = requests.get(url)
soup = BeautifulSoup(r.text,"lxml")
item = soup.select_one("[data-react-class='SharedStateHydrator']")
categories = json.loads(item.get("data-react-props"))['categoriesResponse']['data']
for linklist in categories:
links = linklist['relationships']
print(links)
单个块中的几个输出:
{'stores': {'links': {'related': 'https://store-directory-api.afterpay.com/api/v1/categories/jewellery/stores?locale=en-AU'}}, 'topStores': {'links': {'related': 'https://store-directory-api.afterpay.com/api/v1/categories/jewellery/stores?locale=en-AU'}}, 'featuredStores': {'links': {'related': 'https://store-directory-api.afterpay.com/api/v1/categories/jewellery/stores?featured=true&locale=en-AU'}}, 'children': {'data': [{'type': 'categories', 'id': '135'}, {'type': 'categories', 'id': '326'}, {'type': 'categories', 'id': '38'}]}}
所有与related
键相关的链接。
如何获取所有链接?
答案 0 :(得分:1)
尝试一下:
import json
import requests
from bs4 import BeautifulSoup
url = 'https://www.afterpay.com/en-AU/categories'
r = requests.get(url)
soup = BeautifulSoup(r.text,"lxml")
item = soup.select_one("[data-react-class='SharedStateHydrator']")
categories = json.loads(item.get("data-react-props"))['categoriesResponse']['data']
json_data = []
for linklist in categories:
links = linklist['relationships']
#iterate all related url
for sub_dict in links:
if "children" == sub_dict:
continue
# fetch all related url
related_url = links[sub_dict]['links']['related']
# fetch all related ulr json response
links[sub_dict]['links']['response_data'] = requests.get(related_url).json()
json_data.append(links)
print(json_data)
答案 1 :(得分:1)
只是遍历字典
import json
import requests
from bs4 import BeautifulSoup
url = 'https://www.afterpay.com/en-AU/categories'
r = requests.get(url)
soup = BeautifulSoup(r.text,"lxml")
item = soup.select_one("[data-react-class='SharedStateHydrator']")
categories = json.loads(item.get("data-react-props"))['categoriesResponse']['data']
for linklist in categories:
links = linklist['relationships']
for key,related in links.items():
if 'links' in related.keys():
for key2,link in related.get('links').items():
print(link)
答案 2 :(得分:1)
以下内容非常快捷(尽管值得确认这是必需的列表)
import re, requests
r = requests.get('https://www.afterpay.com/en-AU/categories')
p = re.compile(r"related":"(.*?)&")
links = p.findall(r.text)