我正在尝试通过网络抓取此page
它有2个问题:
1)我正在尝试从软件包详细信息标签中存在的表中获取数据,但没有任何结果。我的选择器路径正确,但是没有输出。所需的输出如下:
2)虽然我得到的是图像“ src”文本,但是我没有得到用于图像的必需文本。所需的输出如下。
import requests
from bs4 import BeautifulSoup
result = []
response = requests.get("https://www.ikea.com/sa/en/catalog/products/00361049/")
assert response.ok
page = BeautifulSoup(response.text, "html.parser")
for record in page.find_all('.packages-specification-table tr:last-child'):
for data in record.find_all('td'):
print(data.text)
for record1 in page.find_all('.packages-specification-table tr:first-child'):
for data1 in record1.find_all('th'):
print(data1)
for des in page.find_all('img'):
image= des.get('src')
print(image)
必需的表输出:
货号00361049
包装1
宽度74厘米
高度48厘米
长度106厘米
直径-
重量30.00公斤
必需的图像输出src:
/PIAimages/0618875_PE688687_S1.JPG
/PIAimages/0325432_PE517964_S1.JPG
/PIAimages/0690287_PE723209_S1.JPG
/PIAimages/0513996_PE639275_S1.JPG
/PIAimages/0325450_PE517970_S1.JPG
答案 0 :(得分:0)
此页面使用JavaScript加载数据。
此代码获取图片的网址。
import requests
url = 'https://www.ikea.com/sa/en/iows/catalog/products/?catalog=departments&category=10687&type=json&dataset=small,allImages,prices&count=11&sort=relevance&sortorder=ascending&startIndex=0'
r = requests.get(url)
data = r.json()
for item in data['products']:
print(item['item']['name'])
for image in item['item']['images']['large']:
print(image)
其他信息可以在JavaScript加载的其他文件中。
您可以在Chrome / Firefox的DevTools中找到它们-选项卡:网络,过滤器:XHR
编辑:
此页面使用JavaScript,但BS未运行JavaScrit。
当我在网络浏览器中关闭JavaScript时,我会在不同的标记中看到元素,然后在您的代码中看到。
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.ikea.com/sa/en/catalog/products/00361049/")
soup = BeautifulSoup(r.text, "html.parser")
html = soup.select('div#productDimensionsContainer div#metric')[0].encode_contents().decode().strip()
data = list(filter(None, html.split('<br/>')))
print(data)
# ['Width: 82 cm', 'Depth: 96 cm', 'Height: 101 cm', 'Seat width: 49 cm', 'Seat depth: 54 cm', 'Seat height: 45 cm']
html = soup.select('div#custMaterials')[0].encode_contents().decode().strip()
data = list(filter(None, html.split('<br/>')))
print(data)
# ['Total composition: 100% polyester', 'Frame: Solid wood, Plywood, Particleboard, Polyurethane foam 25 kg/cu.m., Polyurethane foam 35 kg/cu.m., Polyester wadding', 'Seat cushion: Polyurethane foam 35 kg/cu.m., Polyester wadding', 'Leg: Solid beech, Clear lacquer']
编辑:
还有<script>
和var jProductData=...
,并且表中有信息。
import requests
from bs4 import BeautifulSoup
import json
r = requests.get("https://www.ikea.com/sa/en/catalog/products/00361049/")
soup = BeautifulSoup(r.text, "html.parser")
# var jProductData = {"product":{"items": ... }};
all_scripts = soup.select('script')
for script in all_scripts:
script = script.encode_contents().decode().strip()
if 'var jProductData' in script:
for row in script.split('\n'):
if 'var jProductData' in row:
data = json.loads(row.strip()[19:-1])
for item in data['product']['items']:
#print(item['pkgInfoArr'][0])
print('articleNumber:', item['pkgInfoArr'][0]['articleNumber'])
print('weightMet:', item['pkgInfoArr'][0]['pkgInfo'][0]['weightMet'])
print('widthMet:', item['pkgInfoArr'][0]['pkgInfo'][0]['widthMet'])
print('quantity:', item['pkgInfoArr'][0]['pkgInfo'][0]['quantity'])
print('consumerPackNo:', item['pkgInfoArr'][0]['pkgInfo'][0]['consumerPackNo'])
print('lengthMet:', item['pkgInfoArr'][0]['pkgInfo'][0]['lengthMet'])
print('heightMet:', item['pkgInfoArr'][0]['pkgInfo'][0]['heightMet'])
print('---')
结果:
articleNumber: 20343224
weightMet: 30.40 kg
widthMet: 74 cm
quantity: 1
consumerPackNo: 1
lengthMet: 106 cm
heightMet: 48 cm
---
articleNumber: 00361049
weightMet: 30.00 kg
widthMet: 74 cm
quantity: 1
consumerPackNo: 1
lengthMet: 106 cm
heightMet: 48 cm
---
articleNumber: 90361894
weightMet: 29.70 kg
widthMet: 74 cm
quantity: 1
consumerPackNo: 1
lengthMet: 106 cm
heightMet: 47 cm
---
articleNumber: 80359844
weightMet: 30.00 kg
widthMet: 74 cm
quantity: 1
consumerPackNo: 1
lengthMet: 106 cm
heightMet: 53 cm
---
articleNumber: 40359855
weightMet: 31.00 kg
widthMet: 75 cm
quantity: 1
consumerPackNo: 1
lengthMet: 106 cm
heightMet: 49 cm
---
articleNumber: 10413953
weightMet: 29.90 kg
widthMet: 74 cm
quantity: 1
consumerPackNo: 1
lengthMet: 106 cm
heightMet: 47 cm
---
articleNumber: 40433247
weightMet: 29.90 kg
widthMet: 74 cm
quantity: 1
consumerPackNo: 1
lengthMet: 106 cm
heightMet: 47 cm
---
可能还有其他信息,例如图像的url,但我没有深入var jProductData
来找到它。