是否有一种方法可以从基于Javascript的网站中抓取文本,例如:https://www.ajio.com/ajio-mid-rise-slim-fit-cargo-pants/p/460151939_brown
我需要此页上的产品规格该怎么做?
答案 0 :(得分:0)
可以使用Selenium Webdriver轻松提取产品详细信息-
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.ajio.com/ajio-mid-rise-slim-fit-cargo-pants/p/460151939_brown')
list_product = driver.find_elements_by_xpath('//ul[@class="prod-list"]/li')
description_1 = list_product[0].text
相似,您可以提取所有其他值。
答案 1 :(得分:0)
没有硒,只有正则表达式。
import re
import json
import requests
from pprint import pprint
from sys import exit
headers = {
'Connection': 'keep-alive',
'Cache-Control': 'max-age=0',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'DNT': '1',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'es-ES,es;q=0.9,en;q=0.8',
}
response = requests.get('https://www.ajio.com/ajio-mid-rise-slim-fit-cargo-pants/p/460151939_brown', headers=headers)
html = response.content
regex = ur"<script>\s+window.__PRELOADED_STATE__ =(.*);\s+<\/script>\s+<script\s+id\s+=\s+\"appJs\""
data = re.findall(regex, html, re.MULTILINE | re.DOTALL)[0]
json = json.loads(data)
details = []
for row in json['product']['productDetails']['featureData']:
try:
value = row['featureValues'][0]['value']
except KeyError:
value = None
finally:
details.append({'name': row['name'], 'value' : value})
pprint(details)
结果:
[{'name': u'Highlight', 'value': u'Multiple pockets'},
{'name': u'Hidden Detail', 'value': u'Belt loops'},
{'name': u'Additional Informations', 'value': u'Zip fly closure'},
{'name': u'Waist Rise', 'value': u'Mid Rise'},
{'name': u'Fabric Composition', 'value': u'100% Cotton'},
{'name': u'Size worn by Model', 'value': u'32'},
{'name': u'Fit Type', 'value': u'Straight Fit'},
{'name': u'Size Detail', 'value': u'Fits true to standard size on the model'},
{'name': u'Wash Care', 'value': u'Machine wash'},
{'name': u'Model Waist Size', 'value': u'32"'},
{'name': u'Model Height', 'value': u"6'"},
{'name': u'Size Format', 'value': None}]