我无法使用BeautifulSoup4获取完整数据

时间:2015-08-17 09:21:33

标签: python-2.7 beautifulsoup

我正在尝试将此website用于简单学习,我只是尝试使用find_all()命令打印该网站中的所有产品。共有12个产品标记为tbody,类product-variant-list。但我只得到五个,我无法找到这里的问题。

我的代码:

from bs4 import BeautifulSoup
import urllib2

url = 'http://www.zooplus.co.uk/shop/dogs/dry_dog_food/royal_canin_vet_diet'
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html,"lxml")

product_list = soup.find_all("tbody", {"class":"product-variants-list"})

i=0

for product in product_list:

    product_name = product.find("a",{"class":"follow3"}).find("b").text
    print i, product_name
    #product_variants = product.find_all("tr",{"class":"product-variant"})
    i +=1

html是:

<table id="product-list" width="658" cellspacing="0" cellpadding="2" border="0">

    <tbody class="products-header"></tbody>
    <tbody class="product-variants-list">
        <tr></tr>
        <tr class="text" style="background-color:#ffffff;">
            <td valign="middle" colspan="6">
                <a class="follow3" title="Royal Canin Veterinary Diet - Hypoallergenic DR 21" href="/shop/dogs/dry_dog_food/royal_canin_vet_diet/307309">
                    <b>

                        Royal Canin Veterinary Diet - Hypoallergenic DR 21

                    </b>
                    ::after
                </a>
            </td>
        </tr>
        <tr class="text" style="background-color:#ffffff;"></tr>
        <tr class="text product-variant"></tr>
        <tr class="text product-variant"></tr>
        <tr class="text product-variant"></tr>
    </tbody>
    <tbody class="product-adzone"></tbody>
    <tbody class="product-variants-list"></tbody>
    <tbody class="product-adzone"></tbody>
    <tbody class="product-variants-list"></tbody>
    <tbody class="product-adzone"></tbody>
    <tbody class="product-variants-list"></tbody>
    <tbody class="product-adzone"></tbody>
    <tbody class="product-variants-list"></tbody>
    <tbody class="product-adzone"></tbody>
    <tbody class="product-variants-list"></tbody>
    <tbody class="product-adzone"></tbody>
    <tbody class="product-variants-list"></tbody>
    <tbody class="product-adzone"></tbody>
    <tbody class="product-variants-list"></tbody>
    <tbody class="product-adzone"></tbody>
    <tbody class="product-variants-list"></tbody>
    <tbody class="product-adzone"></tbody>
    <tbody class="product-variants-list"></tbody>
    <tbody class="product-adzone"></tbody>
    <tbody class="product-variants-list"></tbody>
    <tbody class="product-adzone"></tbody>
    <tbody class="product-variants-list"></tbody>
    <tbody class="product-adzone"></tbody>
    <tbody class="products-footer"></tbody>

</table>

我的输出:

0 Royal Canin Veterinary Diet - Hypoallergenic DR 21
1 Royal Canin Veterinary Diet - Sensitivity Control SC 21
2 Royal Canin Veterinary Diet - Gastro Intestinal GI 25
3 Royal Canin Veterinary Diet - Renal RF 14
4 Royal Canin Veterinary Diet - Obesity Management DP 34

1 个答案:

答案 0 :(得分:0)

我认为你的错误是这一行:

soup = BeautifulSoup(html,"lxml")

如果您使用“html.parser”更改“lxml”,它将起作用。

这是完整的代码:

from bs4 import BeautifulSoup
import urllib2

url = 'http://www.zooplus.co.uk/shop/dogs/dry_dog_food/royal_canin_vet_diet'
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html,"html.parser")

product_list = soup.find_all("tbody", {"class":"product-variants-list"})

i=0

for product in product_list:

    product_name = product.find("a",{"class":"follow3"}).find("b").text
    print i, product_name
#product_variants = product.find_all("tr",{"class":"product-variant"})
i +=1

结果是:

0 Royal Canin Veterinary Diet - Hypoallergenic DR 21
1 Royal Canin Veterinary Diet - Sensitivity Control SC 21
2 Royal Canin Veterinary Diet - Gastro Intestinal GI 25
3 Royal Canin Veterinary Diet - Renal RF 14
4 Royal Canin Veterinary Diet - Obesity Management DP 34
5 Royal Canin Veterinary Diet - Urinary S/O LP 18
6 Royal Canin Veterinary Diet - Mobility MS 25
7 Royal Canin Veterinary Diet - Satiety Support SAT 30
8 Royal Canin Veterinary Diet - Hepatic HF 16
9 Royal Canin Veterinary Diet - Dental DLK 22
10 Royal Canin Veterinary Diet - Diabetic DS 37
11 Royal Canin Veterinary Diet - Calm CD 25

希望它有所帮助!

度过愉快的一天

**更新**

lxml和html解析器之间的区别在这里有很好的解释:http://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers

  

如果文档形式不完整,不同的解析器会给出不同的结果。这是使用lxml的HTML解析器解析的简短无效文档。请注意,忽略悬空

标记:

lxml的结果

BeautifulSoup("<a></p>", "lxml")
<html><body><a></a></body></html>

这是使用html5lib解析的同一文档:

html5lib的结果

BeautifulSoup("<a></p>", "html5lib")
<html><head></head><body><a><p></p></a></body></html>

html.parser的结果

BeautifulSoup("<a></p>", "html.parser")
<a></a>