从在BeautifulSoup中包含嵌套span标签的span标签中抓取文本

时间:2020-10-10 18:28:27

标签: python web-scraping beautifulsoup

我在Google上进行了很多搜索,但无法找到解决此问题的理想代码行。

如何使用Python的BeautifulSoup库从给定的HTML代码中提取55,000.00。

<span style="text-decoration: inherit; white-space: nowrap;">
<span class="currencyINR">
&nbsp;&nbsp;
</span>
<span class="currencyINRFallback" style="display:none">
Rs. 
</span>
35,916.00
</span>

以上HTML代码是以下链接的一部分-https://www.amazon.in/gp/offer-listing/B01671J2I6/ref=dp_olp_afts?ie=UTF8&condition=all&qid=1602348797&sr=1-19

我尝试了以下代码:

import requests
from bs4 import BeautifulSoup

URL = "https://www.amazon.in/gp/offer-listing/B01671J2I6/ref=dp_olp_afts? 
ie=UTF8&condition=all&qid=1602348797&sr=1-19"

HEADER = {'User-Agent' : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) 
ppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.6"}

page = requests.get(URL, headers=HEADER)
soup = BeautifulSoup(page.content, "html.parser")
price = soup.find("span", {"style" : "text-decoration: inherit; white-space: 
nowrap;"}).getText()
print(price)

它给了我

AttributeError: 'NoneType' object has no attribute 'getText'

1 个答案:

答案 0 :(得分:0)

对于您问题中给出的网址,这是您如何获取价格的方法:

import requests
from bs4 import BeautifulSoup

URL = "https://www.amazon.in/gp/offer-listing/B01671J2I6/ref=dp_olp_afts?ie=UTF8&condition=all&qid=1602348797&sr=1-19/"

HEADER = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.6",
}

page = requests.get(URL, headers=HEADER)
soup = BeautifulSoup(page.content, "html5lib")
price_spans = soup.find_all("span", {"style": "text-decoration: inherit; white-space: nowrap;"})
print([p.getText(strip=True) for p in price_spans])

输出:['Rs.35,916.00', 'Rs.35,916.00', 'Rs.45,000.00']

注意:我已经更改了HTML解析器,因此您可能必须首先进行pip install html5lib