从基于Java或Ajax的网页中提取文本?

时间:2018-07-25 08:08:37

标签: python web-scraping

是否有一种方法可以从基于Javascript的网站中抓取文本,例如:https://www.ajio.com/ajio-mid-rise-slim-fit-cargo-pants/p/460151939_brown

我需要此页上的产品规格该怎么做?

2 个答案:

答案 0 :(得分:0)

可以使用Selenium Webdriver轻松提取产品详细信息-

from selenium import webdriver

driver = webdriver.Chrome()    
driver.get('https://www.ajio.com/ajio-mid-rise-slim-fit-cargo-pants/p/460151939_brown')
list_product = driver.find_elements_by_xpath('//ul[@class="prod-list"]/li')
description_1 = list_product[0].text

相似,您可以提取所有其他值。

答案 1 :(得分:0)

没有硒,只有正​​则表达式。

import re
import json
import requests

from pprint import pprint
from sys import exit 

headers = {
    'Connection': 'keep-alive',
    'Cache-Control': 'max-age=0',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'DNT': '1',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'es-ES,es;q=0.9,en;q=0.8',
}

response = requests.get('https://www.ajio.com/ajio-mid-rise-slim-fit-cargo-pants/p/460151939_brown', headers=headers)
html =  response.content

regex = ur"<script>\s+window.__PRELOADED_STATE__ =(.*);\s+<\/script>\s+<script\s+id\s+=\s+\"appJs\""

data = re.findall(regex, html, re.MULTILINE | re.DOTALL)[0]
json =  json.loads(data)

details = []

for row in json['product']['productDetails']['featureData']:

    try:
        value = row['featureValues'][0]['value']
    except KeyError:
        value = None
    finally:
        details.append({'name': row['name'], 'value' : value})

pprint(details)

结果:

[{'name': u'Highlight', 'value': u'Multiple pockets'},
 {'name': u'Hidden Detail', 'value': u'Belt loops'},
 {'name': u'Additional Informations', 'value': u'Zip fly closure'},
 {'name': u'Waist Rise', 'value': u'Mid Rise'},
 {'name': u'Fabric Composition', 'value': u'100% Cotton'},
 {'name': u'Size worn by Model', 'value': u'32'},
 {'name': u'Fit Type', 'value': u'Straight Fit'},
 {'name': u'Size Detail', 'value': u'Fits true to standard size on the model'},
 {'name': u'Wash Care', 'value': u'Machine wash'},
 {'name': u'Model Waist Size', 'value': u'32"'},
 {'name': u'Model Height', 'value': u"6'"},
 {'name': u'Size Format', 'value': None}]