我有代码来抓取一个网页,它会返回多个这样的实例:
<div class="post"><a title="Brass-plated door knob" href="http:URL-LINK">
<img src="IMAGE SOURCE LINK" alt="IMAGE ALTERNATE TEXT" />
<span class="det"><em class="fl">3.87 dollars</em><em class="fr">Housewares</em></span>
<strong class="vtitle">Brass-plated door knob</strong></a>
<div class="desc"><p>Brass-plated door knob</p></div></div>
我想从每个链接中获取href链接和相应的价格并对它们进行排序,理想输出为
HIGHEST PRICE, URL-LINK
'...
LOWEST PRICE, URL-LINK
我可以提高价格(虽然它们带有“#34;美元和#34;我可以不用”这个词)
price = soup.find_all("em", class_="fl")
但不确定如何获取相应的href链接,然后对所有链接进行排序和列出。
现在我按如下方式遍历输出:
if len(price) < 100:
for x in range(1, len(price)):
print price[x]
else:
print len(price)**
想法?
答案 0 :(得分:1)
我们的想法是迭代所有帖子并获取每个帖子的链接和价格。
基于您输入的工作示例:
from bs4 import BeautifulSoup
data = """
<div>
<div class="post">
<a title="Brass-plated door knob" href="http:URL-LINK">
<img src="IMAGE SOURCE LINK" alt="IMAGE ALTERNATE TEXT"/>
<span class="det"><em class="fl">3.87 dollars</em><em class="fr">Housewares</em></span>
<strong class="vtitle">Brass-plated door knob</strong>
</a>
<div class="desc"><p>Brass-plated door knob</p></div>
</div>
<div class="post">
<a title="Brass-plated door knob2" href="http:URL-LINK2">
<img src="IMAGE SOURCE LINK" alt="IMAGE ALTERNATE TEXT"/>
<span class="det"><em class="fl">410.25 dollars</em><em class="fr">Housewares</em></span>
<strong class="vtitle">Brass-plated door knob2</strong>
</a>
<div class="desc"><p>Brass-plated door knob2</p></div>
</div>
</div>
"""
soup = BeautifulSoup(data)
result = []
for post in soup.select('div.post'):
link = post.a.get('href')
price = float(post.find('em', class_='fl').text.split(' ')[0])
result.append({'link': link, 'price': price})
print result
打印:
[
{'price': 3.87, 'link': 'http:URL-LINK'},
{'price': 410.25, 'link': 'http:URL-LINK2'}
]
答案 1 :(得分:0)
从您的HTML中,您可以获得相应的价格链接,
prices = soup.find_all("em", class_="fl")
for price in prices:
print price.findParent('a').get('href'), price.text.split()[0]
刮擦时排序不方便。您可以将价格和链接存储在字典中。让价格像alecxe的回答一样浮动,并在刮擦后对它们进行排序。