作为练习作业,我尝试使用BeautifulSoup库从Amazon解析this search results page。
这是我的代码。
from urllib import urlopen
from bs4 import BeautifulSoup
SourceURL = "http://www.amazon.in/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=android"
ResultsPage = urlopen(SourceURL )
Soup = BeautifulSoup(ResultsPage)
print "<SearchResults>"
for SearchResult in Soup.findAll('li', attrs={'class': 's-result-item celwidget'}):
#Read Result Title
Title = SearchResult.find("h2", {"class": "a-size-medium a-color-null s-inline s-access-title a-text-normal"})
ResultTag = "\t<Result><![CDATA["
if Title is not None:
ResultTag += Title.text
ResultTag += "]]></Result>"
print ResultTag
print "</SearchResults>"
显示的输出如下
<SearchResults>
<Result><![CDATA[Micromax Bolt S301 (Black, No charger, No earphone inbox)]]></Result>
<Result><![CDATA[Android Application Development (with Kitkat Support), Black Book]]></Result>
<Result><![CDATA[ZTE Blade Buzz White V815W]]></Result>
<Result><![CDATA[Android: App Development & Programming Guide: Learn In A Day! (Android, Rails, Ruby Programming, App Development...]]></Result>
<Result><![CDATA[]]></Result>
<Result><![CDATA[Karbonn Titanium S21 (Grey)]]></Result>
<Result><![CDATA[Head First Android Development]]></Result>
<Result><![CDATA[Micromax Canvas A1 Android One (White, 8GB)]]></Result>
<Result><![CDATA[Professional Android 4 Application Development (Wrox)]]></Result>
<Result><![CDATA[OnePlus X (Onyx) - Invite Only]]></Result>
<Result><![CDATA[Lenovo Vibe S1 (4G, White)]]></Result>
<Result><![CDATA[Micromax Bolt D320 (Black, 4GB)]]></Result>
<Result><![CDATA[2 in 1 Capacitive Stylus Pen With Black Ball Pen for Android Touch Sceen Mobile Phones and Tablets All iPads and...]]></Result>
<Result><![CDATA[Moto E 2nd Generation XT1506 (3G, Black)]]></Result>
<Result><![CDATA[Android: App Development & Programming Guide: Learn In A Day!]]></Result>
<Result><![CDATA[Lenovo Vibe S1 (4G, Dark Blue)]]></Result>
</SearchResults>
如果您注意到,由于某种原因,输出中缺少第五个结果,而它会打印具有相同代码的所有其他行。实质上,SearchResult.find()方法仅为一条记录返回NULL值。
如果我遗失了什么,能告诉我吗?
谢谢, NIKHIL
答案 0 :(得分:0)
如果您查看链接http://www.amazon.in/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=android,则第5个li
元素符合您的班级名称s-result-item celwidget
的条件,该条件实际上是Customers shopped for android in
且与您的第二个不完全匹配a-size-medium a-color-null s-inline s-access-title a-text-normal
的条件,导致Title
设置为无。
您可以将条件更新到下方以打印所需的输出。
if Title is not None:
ResultTag = "\t<Result><![CDATA["
ResultTag += Title.text
ResultTag += "]]></Result>"
print ResultTag