从python中的html中提取2条信息

时间:2017-10-11 05:46:02

标签: python html html-table beautifulsoup lxml

我需要帮助找出如何提取Grab和数据后面的数字-b。完整未经修改的网页中有许多<tr>,我需要使用&#34; Need&#34;就在</a>之前。我一直试图用美丽的汤做这个,虽然看起来lxml可能会更好。我可以获得包含“需要”的所有<tr>个或仅< a>...< /a>行,但不仅包含<tr>行中包含需要的<a>行。

<tr >
     <td>3</td>
     <td><a href="/local/app">Leave</a></td><td><a href="https://www.leave.com/" target="_blank">Useless</a></td>
     <td class="text-right"> <span class="float2" data-a="24608000.0" data-b="518" data-n="818">Garbage</span></td>
     <td class="text-right"> <span class="Float" data-a="3019" data-b="0.0635664" data-n="283">Garbage2</span></td>
     <td class="text-right">7.38%</td>
     <td class="text-right " >Recently</td>
</tr>

<tr >
     <td>4</td>
     <td><a href="/local">Grab</a></td><td><a href="https://grab.com" target="_blank">Need</a></td>
     <td class="text-right"> <span class="bloat2" data="22435000.0" data-b="512" data-n="74491.2">More junk</span></td>
     <td class="text-right"> <span class="bloat" data-a="301.177" data-b="35.848" data-n="0.5848">More junk2</span></td>
     <td class="text-right">Some more</td>
     <td class="text-right " >Recently</td>
</tr>

感谢您的帮助!

1 个答案:

答案 0 :(得分:0)

from bs4 import BeautifulSoup


data = '''<tr>
 <td>3</td>
 <td><a href="/local/app">Leave</a></td><td><a href="https://www.leave.com/" target="_blank">Useless</a></td>
 <td class="text-right"> <span class="float2" data-a="24608000.0" data-b="518" data-n="818">Garbage</span></td>
 <td class="text-right"> <span class="Float" data-a="3019" data-b="0.0635664" data-n="283">Garbage2</span></td>
 <td class="text-right">7.38%</td>
 <td class="text-right " >Recently</td>
</tr>

<tr>
 <td>4</td>
 <td><a href="/local">Grab</a></td><td><a href="https://grab.com" target="_blank">Need</a></td>
 <td class="text-right"> <span class="bloat2" data="22435000.0" data-b="512" data-n="74491.2">More junk</span></td>
 <td class="text-right"> <span class="bloat" data-a="301.177" data-b="35.848" data-n="0.5848">More junk2</span></td>
 <td class="text-right">Some more</td>
 <td class="text-right " >Recently</td>
</tr>
'''

soup = BeautifulSoup(data)
print(soup.findAll('a',{"href":"/local" })[0].text)
for a in soup.findAll('span',{"class":["bloat","bloat2"]}):
  print(a['data-b'])