我正在尝试抓取此特定网站的数据,但困难的部分是我需要的详细信息存在于脚本中。我尝试使用beautifulsoup和selenium但是我也没有得到我需要的东西
http://www.myntra.com/tops/rare/rare-burgundy-crepe-blouson-top/1437335/buy
输出必须是
Product Details Burgundy woven blouson top with gathers, has a round neck, sleeveless, criss-cross detail on the back Material & Care Crepe Hand-wash cold
这是我正在尝试的代码。
from bs4 import BeautifulSoup
import urllib
x=urllib.urlopen("http://www.myntra.com/tops/rare/rare-burgundy-crepe-blouson-top/1437335/buy")
soup2 = BeautifulSoup(x, 'html.parser')
for i in soup2.find_all('p',atttrs={'class':'pdp-product-description-content'}):
print i.text
答案 0 :(得分:0)
你最好使用正则表达式& amp; JSON。首先,您需要使用正则表达式从页面源中提取脚本变量,然后将提取的文本加载到 json 对象中,最后将数据存储在下productDescriptors 键在最后的json dict。
import re
import json
import urllib
from bs4 import BeautifulSoup
x = urllib.urlopen("http://www.myntra.com/tops/rare/rare-burgundy-crepe-blouson-top/1437335/buy")
soup2 = BeautifulSoup(x, 'html.parser')
product_data = re.findall('({"pdpData".+?)</script>', re.sub('\n', ' ', soup2.prettify()))
product_description = json.loads(product_data[0])['pdpData']['productDescriptors']
print product_description
答案 1 :(得分:0)
一种方法是:
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'http://www.myntra.com/tops/rare/rare-burgundy-crepe-blouson-top/1437335/buy'
driver = webdriver.Chrome()
driver.maximize_window()
driver.get(url)
req = driver.page_source
soup = BeautifulSoup(req,'html.parser')
producttitle = soup.find('h6',{'class':'pdp-product-description-title'})
productbody = soup.find('p',{'class':'pdp-product-description-content'})
print producttitle.text, productbody.text
matc = soup.findAll('h6',{'class':'pdp-product-description-title'})[1:]
for x in matc:
print x.text
matd = soup.findAll('p',{'class':'pdp-product-description-content'})[1:]
for y in matd:
print y.text
这将打印:
Product Details Burgundy woven blouson top with gathers, has a round neck, sleeveless, criss-cross detail on the back
Material & Care
CrepeHand-wash cold