我正在尝试从ecommerce网站上抓取一些属性
但数据没有存储在html中,而是存储在javascript script
标签中
我正在尝试从productId
标记中获取product
,brand
,script
import requests
from bs4 import BeautifulSoup
base_url = "https://www.myntra.com/men-formal-shirts?f=Collar%3AButton-Down%20Collar"
r = requests.get(base_url)
soup = BeautifulSoup(r.text, 'html.parser')
scripts = soup.find_all('script')[8]
scripts
答案 0 :(得分:3)
import requests
from bs4 import BeautifulSoup
import json
import pyjsparser
r = requests.get(
"https://www.myntra.com/men-formal-shirts?f=Collar%3AButton-Down%20Collar&p=1")
soup = BeautifulSoup(r.text, 'html.parser')
script = soup.findAll("script")[8].text
tree = pyjsparser.parse(script)
print(tree.keys())
答案 1 :(得分:2)
您可以将script
作为text
并从一开始就删除window.__myx =
,您将获得正确的JSON数据,可以使用标准模块json
将其转换为Python的字典。
然后您可以使用keys
和for
循环来获取信息
import requests
from bs4 import BeautifulSoup
import json
base_url = "https://www.myntra.com/men-formal-shirts?f=Collar%3AButton-Down%20Collar"
r = requests.get(base_url)
soup = BeautifulSoup(r.text, 'html.parser')
# get .text
scripts = soup.find_all('script')[8].text
# remove window.__myx =
script = scripts.split('=', 1)[1]
# convert to dictionary
data = json.loads(script)
for item in data['searchData']['results']['products']:
print('product:', item['product'])
print('productId:', item['productId'])
print('brand:', item['brand'])
print('---')
结果:
product: Louis Philippe Men White & Blue Slim Fit Checked Formal Shirt
productId: 11390900
brand: Louis Philippe
---
product: Hancock Men White Slim Fit Solid Formal Shirt
productId: 7460073
brand: Hancock
---
product: INVICTUS Men Navy Slim Fit Printed Semiformal Shirt
productId: 6970620
brand: INVICTUS
---
product: next Men White Slim Fit Solid Formal Shirt
productId: 11067410
brand: next
---
product: INVICTUS Men White & Green Slim Fit Printed Semiformal Shirt
productId: 2314014
brand: INVICTUS
---
product: Dazzio Men Black Modern Slim Fit Solid Formal Shirt
productId: 3009355
brand: Dazzio
---
etc.