我正在尝试将此website用于简单学习,我只是尝试使用find_all()
命令打印该网站中的所有产品。共有12个产品标记为tbody
,类product-variant-list
。但我只得到五个,我无法找到这里的问题。
我的代码:
from bs4 import BeautifulSoup
import urllib2
url = 'http://www.zooplus.co.uk/shop/dogs/dry_dog_food/royal_canin_vet_diet'
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html,"lxml")
product_list = soup.find_all("tbody", {"class":"product-variants-list"})
i=0
for product in product_list:
product_name = product.find("a",{"class":"follow3"}).find("b").text
print i, product_name
#product_variants = product.find_all("tr",{"class":"product-variant"})
i +=1
html是:
<table id="product-list" width="658" cellspacing="0" cellpadding="2" border="0">
<tbody class="products-header"></tbody>
<tbody class="product-variants-list">
<tr></tr>
<tr class="text" style="background-color:#ffffff;">
<td valign="middle" colspan="6">
<a class="follow3" title="Royal Canin Veterinary Diet - Hypoallergenic DR 21" href="/shop/dogs/dry_dog_food/royal_canin_vet_diet/307309">
<b>
Royal Canin Veterinary Diet - Hypoallergenic DR 21
</b>
::after
</a>
</td>
</tr>
<tr class="text" style="background-color:#ffffff;"></tr>
<tr class="text product-variant"></tr>
<tr class="text product-variant"></tr>
<tr class="text product-variant"></tr>
</tbody>
<tbody class="product-adzone"></tbody>
<tbody class="product-variants-list"></tbody>
<tbody class="product-adzone"></tbody>
<tbody class="product-variants-list"></tbody>
<tbody class="product-adzone"></tbody>
<tbody class="product-variants-list"></tbody>
<tbody class="product-adzone"></tbody>
<tbody class="product-variants-list"></tbody>
<tbody class="product-adzone"></tbody>
<tbody class="product-variants-list"></tbody>
<tbody class="product-adzone"></tbody>
<tbody class="product-variants-list"></tbody>
<tbody class="product-adzone"></tbody>
<tbody class="product-variants-list"></tbody>
<tbody class="product-adzone"></tbody>
<tbody class="product-variants-list"></tbody>
<tbody class="product-adzone"></tbody>
<tbody class="product-variants-list"></tbody>
<tbody class="product-adzone"></tbody>
<tbody class="product-variants-list"></tbody>
<tbody class="product-adzone"></tbody>
<tbody class="product-variants-list"></tbody>
<tbody class="product-adzone"></tbody>
<tbody class="products-footer"></tbody>
</table>
我的输出:
0 Royal Canin Veterinary Diet - Hypoallergenic DR 21
1 Royal Canin Veterinary Diet - Sensitivity Control SC 21
2 Royal Canin Veterinary Diet - Gastro Intestinal GI 25
3 Royal Canin Veterinary Diet - Renal RF 14
4 Royal Canin Veterinary Diet - Obesity Management DP 34
答案 0 :(得分:0)
我认为你的错误是这一行:
soup = BeautifulSoup(html,"lxml")
如果您使用“html.parser”更改“lxml”,它将起作用。
这是完整的代码:
from bs4 import BeautifulSoup
import urllib2
url = 'http://www.zooplus.co.uk/shop/dogs/dry_dog_food/royal_canin_vet_diet'
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html,"html.parser")
product_list = soup.find_all("tbody", {"class":"product-variants-list"})
i=0
for product in product_list:
product_name = product.find("a",{"class":"follow3"}).find("b").text
print i, product_name
#product_variants = product.find_all("tr",{"class":"product-variant"})
i +=1
结果是:
0 Royal Canin Veterinary Diet - Hypoallergenic DR 21
1 Royal Canin Veterinary Diet - Sensitivity Control SC 21
2 Royal Canin Veterinary Diet - Gastro Intestinal GI 25
3 Royal Canin Veterinary Diet - Renal RF 14
4 Royal Canin Veterinary Diet - Obesity Management DP 34
5 Royal Canin Veterinary Diet - Urinary S/O LP 18
6 Royal Canin Veterinary Diet - Mobility MS 25
7 Royal Canin Veterinary Diet - Satiety Support SAT 30
8 Royal Canin Veterinary Diet - Hepatic HF 16
9 Royal Canin Veterinary Diet - Dental DLK 22
10 Royal Canin Veterinary Diet - Diabetic DS 37
11 Royal Canin Veterinary Diet - Calm CD 25
希望它有所帮助!
度过愉快的一天
lxml和html解析器之间的区别在这里有很好的解释:http://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers
如果文档形式不完整,不同的解析器会给出不同的结果。这是使用lxml的HTML解析器解析的简短无效文档。请注意,忽略悬空
标记:
lxml的结果
BeautifulSoup("<a></p>", "lxml")
<html><body><a></a></body></html>
这是使用html5lib解析的同一文档:
html5lib的结果
BeautifulSoup("<a></p>", "html5lib")
<html><head></head><body><a><p></p></a></body></html>
html.parser的结果
BeautifulSoup("<a></p>", "html.parser")
<a></a>