我需要从几个片段中提取文本(在给定的情况下为“325”和“550”)。我如何使用python 3.6.0,bs4,urllib进行操作。我将把获得的数据添加到csv文件中。
<div class="a-row a-spacing-none">
<a class="a-link-normal a-text-normal" href="https://www.amazon.in/Game-Thrones-Song-Ice-Fire/dp/0007428545">
<span class="a-size-small a-color-secondary">
</span>
<span class="a-size-base a-color-price s-price a-text-bold">
<span class="currencyINR">
</span>
325
</span>
</a>
<span class="a-letter-space">
</span>
<span aria-label='Suggested Retail Price: <span class="currencyINR">&nbsp;&nbsp;</span>550' class="a-size-small a-color-secondary a-text-strike">
<span class="currencyINR">
</span>
550
</span>
</div>
我已尝试使用以下代码,但随后无法删除随附的span标记:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.amazon.in/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=a+song+of+ice+and+fire'
# opening up connection, grabbing thr page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
# html parsing
page_soup = soup(page_html, "html.parser")
# grabs each product
containers = page_soup.findAll("div", {"class":"s-item-container"})
contain = containers[0]
price = contain.findAll("span", {"class":"a-size-base a-color-price s-price a-text-bold"})
current_price = price[0].text.strip()
答案 0 :(得分:0)
对于初学者,您可以选择所有span
类currencyINR
元素。
currency = contain.find('span', attrs={"class":"currencyINR"})
price = currency.nextSibling.strip()
答案 1 :(得分:-1)
我后来解决了这个问题。显然导航并不像我截获的那么困难。然而,这是工作解决方案。
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = "https://www.amazon.in/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=a+song+of+ice+and+fire"
# opening up connection, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
# html parsing
page_soup = soup(page_html, "html.parser")
# grabs each product
containers = page_soup.findAll("div", {"class":"s-item-container"})
# Creates New File:
fileName = "H:\WEBSCRAPER\Result\Products.csv"
headers = "Product Name, Current Price, Original Price\n"
f = open(fileName, "w")
f.write(headers)
errorMsg = "Error! Not Found"
# obtains the data
for contain in containers:
try:
title = contain.h2.text
except IndexError:
title = errorMsg
try:
priceCurrent = contain.findAll("span", {"class":"a-size-base a-color-price s-price a-text-bold"})
CurrentSP = priceCurrent[0].text.strip()
except IndexError:
CurrentSP = errorMsg
try:
priceSuggested = contain.findAll("span", {"class":"a-size-small a-color-secondary a-text-strike"})
SuggestedSP = priceSuggested[0].text.strip()
except IndexError:
SuggestedSP = errorMsg
print("title: " + title)
print("CurrentSP: " + CurrentSP)
print("SuggestedSP: " + SuggestedSP)
f.write(title.replace(",", "|") + "," + CurrentSP.replace(",", "") + "," + SuggestedSP.replace(",", "") + "\n")
f.close()