如何在<script>标记内获取文本

时间:2019-08-27 07:29:52

标签: python html selenium web-scraping beautifulsoup

我正在抓取the LaneBryant website

部分源代码是

<script type="application/ld+json">
{
"@context": "http://schema.org/",
"@type": "Product",
"name": "Flip Sequin Teach & Inspire Graphic Tee",
"image": [
"http://lanebryant.scene7.com/is/image/lanebryantProdATG/356861_0000015477",
"http://lanebryant.scene7.com/is/image/lanebryantProdATG/356861_0000015477_Back"
],
"description": "Get inspired with [...]",
"brand": "Lane Bryant",
"sku": "356861",
"offers": {
"@type": "Offer",
"url": "https://www.lanebryant.com/flip-sequin-teach-inspire-graphic-tee/prd-356861",
"priceCurrency": "USD",
"price":"44.95",
"availability": "http://schema.org/InStock",
"itemCondition": "https://schema.org/NewCondition"
}
}
}
}
</script>

为了获取美元价格,我编写了以下脚本:

 def getPrice(self,start):
            fprice=[]
            discount = ""


            price1 = start.find('script', {'type': 'application/ld+json'})
            data = ""
            #print("price 1 is + "+ str(price1)+"data is "+str(data))
            price1 = str(price1).split(",")
            #price1=str(price1).split(":")
            print("final price +"+ str(price1[11]))

开始于:

        d = webdriver.Chrome('/Users/fatima.arshad/Downloads/chromedriver')
        d.get(url)
        start = BeautifulSoup(d.page_source, 'html.parser')

即使我输入正确的文本,它也不会打印价格。我如何获得价格?

2 个答案:

答案 0 :(得分:1)

在这种情况下,您可以只对价格进行正则表达式

import requests, re

r = requests.get('https://www.lanebryant.com/flip-sequin-teach-inspire-graphic-tee/prd-356861#color/0000015477', headers = {'User-Agent':'Mozilla/5.0'})
p = re.compile(r'"price":"(.*?)"')
print(p.findall(r.text)[0])

否则,请通过id定位适当的脚本标签,然后使用json库解析.text

import requests, json
from bs4 import BeautifulSoup 

r = requests.get('https://www.lanebryant.com/flip-sequin-teach-inspire-graphic-tee/prd-356861#color/0000015477', headers = {'User-Agent':'Mozilla/5.0'})
start = BeautifulSoup(r.text, 'html.parser')
data = json.loads(start.select_one('#pdpInitialData').text)
price = data['pdpDetail']['product'][0]['price_range']['sale_price']
print(price)

答案 1 :(得分:0)

names(df) <- variables$recode[match(names(df), variables$var)]
df
#   A  B  C  D  E  F  G
#1: 8 12 18 32 40 36 32
#2: 6 12 18 24 30 36 30
#3: 8 16 18 24 30 36 18
#4: 4 12 12 24 30 36 24
#5: 6 16 24 32 40 48 24
#6: 8 12 18 24 30 36 30
#7: 8 12 18 24 30 36 18
#8: 8 16 24 32 40 48 40
#9: 8 16 24 24 30 48 48

这实际上是price1 = start.find('script', {'type': 'application/ld+json'}) 标记,因此更好的名称是

<script>

您可以使用script_tag = start.find('script', {'type': 'application/ld+json'}) 访问script标记内的文本。在这种情况下,这将为您提供JSON。

.text

使用JSON解析器来避免误解,而不是用逗号分隔:

json_string = script_tag.text