使用BeautifulSoup 4

时间:2015-11-29 18:36:10

标签: python beautifulsoup html-parsing missing-data

作为练习作业,我尝试使用BeautifulSoup库从Amazon解析this search results page

这是我的代码。

from urllib import urlopen
from bs4 import BeautifulSoup


SourceURL = "http://www.amazon.in/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=android"
ResultsPage = urlopen(SourceURL )
Soup = BeautifulSoup(ResultsPage)


print "<SearchResults>"

for SearchResult in Soup.findAll('li', attrs={'class': 's-result-item celwidget'}):
    #Read Result Title
    Title = SearchResult.find("h2", {"class": "a-size-medium a-color-null s-inline s-access-title a-text-normal"})

    ResultTag = "\t<Result><![CDATA["
    if Title is not None:
        ResultTag += Title.text

    ResultTag += "]]></Result>"
    print ResultTag

print "</SearchResults>"

显示的输出如下

<SearchResults>
    <Result><![CDATA[Micromax Bolt S301 (Black, No charger, No earphone inbox)]]></Result>
    <Result><![CDATA[Android Application Development (with Kitkat Support), Black Book]]></Result>
    <Result><![CDATA[ZTE Blade Buzz White V815W]]></Result>
    <Result><![CDATA[Android:  App Development & Programming Guide: Learn In A Day! (Android, Rails, Ruby Programming, App Development...]]></Result>
    <Result><![CDATA[]]></Result>
    <Result><![CDATA[Karbonn Titanium S21 (Grey)]]></Result>
    <Result><![CDATA[Head First Android Development]]></Result>
    <Result><![CDATA[Micromax Canvas A1 Android One (White, 8GB)]]></Result>
    <Result><![CDATA[Professional Android 4 Application Development (Wrox)]]></Result>
    <Result><![CDATA[OnePlus X (Onyx) - Invite Only]]></Result>
    <Result><![CDATA[Lenovo Vibe S1 (4G, White)]]></Result>
    <Result><![CDATA[Micromax Bolt D320 (Black, 4GB)]]></Result>
    <Result><![CDATA[2 in 1 Capacitive Stylus Pen With Black Ball Pen for Android Touch Sceen Mobile Phones and Tablets All iPads and...]]></Result>
    <Result><![CDATA[Moto E 2nd Generation XT1506 (3G, Black)]]></Result>
    <Result><![CDATA[Android: App Development & Programming Guide: Learn In A Day!]]></Result>
    <Result><![CDATA[Lenovo Vibe S1 (4G, Dark Blue)]]></Result>
</SearchResults>

如果您注意到,由于某种原因,输出中缺少第五个结果,而它会打印具有相同代码的所有其他行。实质上,SearchResult.find()方法仅为一条记录返回NULL值。

如果我遗失了什么,能告诉我吗?

谢谢, NIKHIL

1 个答案:

答案 0 :(得分:0)

如果您查看链接http://www.amazon.in/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=android,则第5个li元素符合您的班级名称s-result-item celwidget的条件,该条件实际上是Customers shopped for android in且与您的第二个不完全匹配a-size-medium a-color-null s-inline s-access-title a-text-normal的条件,导致Title设置为无。

您可以将条件更新到下方以打印所需的输出。

if Title is not None:
    ResultTag = "\t<Result><![CDATA["
    ResultTag += Title.text
    ResultTag += "]]></Result>"
    print ResultTag