Question

我正在尝试抓取此特定网站的数据，但困难的部分是我需要的详细信息存在于脚本中。我尝试使用beautifulsoup和selenium但是我也没有得到我需要的东西

http://www.myntra.com/tops/rare/rare-burgundy-crepe-blouson-top/1437335/buy

输出必须是

Product Details Burgundy woven blouson top with gathers, has a round neck, sleeveless, criss-cross detail on the back Material & Care Crepe Hand-wash cold

这是我正在尝试的代码。

from bs4 import BeautifulSoup
import urllib
x=urllib.urlopen("http://www.myntra.com/tops/rare/rare-burgundy-crepe-blouson-top/1437335/buy")
soup2 = BeautifulSoup(x, 'html.parser')


for i in soup2.find_all('p',atttrs={'class':'pdp-product-description-content'}):
    print i.text

Answer 1

你最好使用正则表达式＆amp; amp; JSON。首先，您需要使用正则表达式从页面源中提取脚本变量，然后将提取的文本加载到 json 对象中，最后将数据存储在下productDescriptors 键在最后的json dict。

import re
import json
import urllib
from bs4 import BeautifulSoup

x = urllib.urlopen("http://www.myntra.com/tops/rare/rare-burgundy-crepe-blouson-top/1437335/buy")

soup2 = BeautifulSoup(x, 'html.parser')

product_data = re.findall('({"pdpData".+?)</script>', re.sub('\n', ' ', soup2.prettify()))

product_description = json.loads(product_data[0])['pdpData']['productDescriptors']

print product_description

Answer 2

一种方法是：

from bs4 import BeautifulSoup
from selenium import webdriver

url = 'http://www.myntra.com/tops/rare/rare-burgundy-crepe-blouson-top/1437335/buy'

driver = webdriver.Chrome()
driver.maximize_window()
driver.get(url)
req = driver.page_source
soup = BeautifulSoup(req,'html.parser')

producttitle = soup.find('h6',{'class':'pdp-product-description-title'})
productbody = soup.find('p',{'class':'pdp-product-description-content'})

print producttitle.text, productbody.text


matc = soup.findAll('h6',{'class':'pdp-product-description-title'})[1:]
for x in matc:
    print x.text

matd = soup.findAll('p',{'class':'pdp-product-description-content'})[1:]
for y in matd:
    print y.text

这将打印：

Product Details Burgundy woven blouson top with gathers, has a round neck, sleeveless, criss-cross detail on the back
Material & Care
CrepeHand-wash cold

如何在网站脚本中获取某些文本

2 个答案: