我正试图从网站上搜集一些数据。这是html格式。我想刮掉"No description for 632930413867".
Html代码:
<div class="col-xs-6 col-sm-6 col-md-6 col-lg-6">
<table class="table product_info_table">
<tbody>
<tr>
<td>GS1 Address</td>
<td>R.R. 1, Box 2, Malmo, NE 68040</td>
</tr>
<tr>
<td>Description</td>
<td>
<div id="read_desc">
No description for 632930413867
</div>
</td>
</tr>
</tbody>
</table>
</div>
和来自这个html的图像src
<div class="centered_image header_image">
<img src="https://images-na.ssl-images-amazon.com/images/I/416EuOE5kIL._SL160_.jpg" title="UPC 632930413867" alt="UPC 632930413867">
所以我使用这段代码
Baseurl = "https://www.buycott.com/upc/632930413867"
uClient = ''
while uClient == '':
try:
uClient = requests.get(Baseurl)
print("Relax we are getting the data...")
except:
print("Connection refused by the server..")
print("Let me sleep for 7 seconds")
time.sleep(7)
print("Was a nice sleep, now let me continue...")
continue
page_html = uClient.content
uClient.close()
page_soup = soup(page_html, "html.parser")
Productcontainer = page_soup.find_all("div", {"class": "row"})
link = page_soup.find(itemprop="image")
print(Productcontainer)
for item in Productcontainer:
print(link)
productdescription = Productcontainer.find("div", {"class": "product_info_table"})
print(productdescription)
当我运行此代码时,不会显示任何数据。我怎样才能获得描述和img src?
答案 0 :(得分:2)
您只需检查html并识别包含您想要抓取的数据的标签
在这种情况下,图像为div.centered_image.header_image img
,描述为div#read_desc
bs4 css selectors的示例:
import requests
from bs4 import BeautifulSoup
baseurl = "https://www.buycott.com/upc/632930413867"
page_html = requests.get(baseurl).content
soup = BeautifulSoup(page_html, "html.parser")
image = soup.select_one('div.centered_image.header_image img')['src']
description = soup.select_one('div#read_desc').text.strip()
print(image)
print(description)
https://images-na.ssl-images-amazon.com/images/I/416EuOE5kIL.SL160.jpg
没有描述632930413867
答案 1 :(得分:0)
这也可以这样做:
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get("https://www.buycott.com/upc/632930413867").text, "lxml")
desc = soup.select("#read_desc")[0].text.strip()
link = soup.select(".centered_image img")[0]['src'].strip()
print("{}\n{}".format(desc,link))
输出:
No description for 632930413867
https://images-na.ssl-images-amazon.com/images/I/416EuOE5kIL._SL160_.jpg