Beautifulsoup通过标签的内容搜索标签

时间:2013-05-08 06:28:38

标签: python beautifulsoup web-crawler

以下HTML代码:

<div class="rating-list">
<ul class="recommend">
<li>
<span class="recommend-titleInline">Stayed April 2013, traveled as a couple</span>
<ul class="recommend-column first">
<li class="recommend-answer">
<span class="rate rate_ss ss50">
<img class="sprite-ratings" src="http://c1.tacdn.com/img2/x.gif" alt="5 of 5 stars" content="5.0"/>
</span>
Value</li>
<li class="recommend-answer">
<span class="rate rate_ss ss50">
<img class="sprite-ratings" src="http://c1.tacdn.com/img2/x.gif" alt="5 of 5 stars" content="5.0"/>
</span>
Location</li>
<li class="recommend-answer">
<span class="rate rate_ss ss50">
<img class="sprite-ratings" src="http://c1.tacdn.com/img2/x.gif" alt="5 of 5 stars" content="5.0"/>
</span>
Sleep Quality</li>
</ul>
<ul class="recommend-column">
<li class="recommend-answer">
<span class="rate rate_ss ss50">
<img class="sprite-ratings" src="http://c1.tacdn.com/img2/x.gif" alt="5 of 5 stars" content="5.0"/>
</span>
Rooms</li>
<li class="recommend-answer">
<span class="rate rate_ss ss50">
<img class="sprite-ratings" src="http://c1.tacdn.com/img2/x.gif" alt="5 of 5 stars" content="5.0"/>
</span>
Cleanliness</li>
<li class="recommend-answer">
<span class="rate rate_ss ss50">
<img class="sprite-ratings" src="http://c1.tacdn.com/img2/x.gif" alt="5 of 5 stars" content="5.0"/>
</span>
Service</li>
</ul>
</li>
</ul>
</div>

现在我使用Beautifulsoup获取整个标签,然后我想得到像这样的“li”标签:

valueRatingTag = subRatingListTags[i].find(name = 'li', attrs = { 'class' : 'recommend-answer' }, text = 'Value')
locationRatingTag = subRatingListTags[i].find(name = 'li', attrs = { 'class' : 'recommend-answer' }, text = 'Location')
sleepRatingTag = subRatingListTags[i].find(name = 'li', attrs = { 'class' : 'recommend-answer' }, text = 'Sleep Quality')
        roomRatingTag = subRatingListTags[i].find(name = 'li', attrs = { 'class' : 'recommend-answer' }, text = 'Rooms')
        cleanRatingTag = subRatingListTags[i].find(name = 'li', attrs = { 'class' : 'recommend-answer' }, text = 'Cleanliness')
        serviceRatingTag = subRatingListTags[i].find(name = 'li', attrs = { 'class' : 'recommend-answer' }, text = 'Service')

但似乎失败了。六个变量都是无,这不是我所期望的。我应该怎么做

2 个答案:

答案 0 :(得分:0)

使用正则表达式作为text帮助的参数吗?

subRatingListTags[i].find(text=re.compile("Location"))

换行符可能导致完全文本匹配失败。

答案 1 :(得分:0)

你不清楚你想要什么。无论如何:

>>> lis = [t for t in soup.find_all('li', 'recommend-answer')]
>>> lis[0].text
'\n\n\n\nValue'
>>> lis[1].text
'\n\n\n\nLocation'
>>> lis[0].img['alt']
'5 of 5 stars'

您肯定希望在开始解析之前预先处理html以删除所有换行符。