Question

我正在尝试使用BS4，我想从下面的例子中打印出确切的TD标签AUD / AED。我知道我可以使用像[-1]这样的解析来总是得到最后一个，但是在其他一些数据上，我想要的TD标签将位于中间。有没有办法可以专门调用AUD / AED标签。

示例：

<table class="RESULTS" width="100%">
<tr>
<th align="left">Base Currency</th>
<th align="left">Quote Currency</th>
<th align="left">Instrument</th>
<th align="left">Spot Date</th>
</tr>
<tr>
<td>AUD</td>
<td>AED</td>
<td>AUD/AED</td>
<td>Wednesday 23 APR 2014</td>
</tr>
</table>

我正在使用的代码：

soup = BeautifulSoup(r)
table = soup.find(attrs={"class": "RESULTS"})
print(table)
days = table.find_all('tr')

这将获得最后一个TR标记，但我需要找到TD标记为AUD / AED的TR标记

我正在寻找类似的东西：

if td[2] == <td>AUD/AED</td>:
    print(tr[-1])

Answer 1

如果你有一个CSS选择器，那么这种事情会更加清晰，但看起来我们不能在这里做到这一点。

下一个最好的选择就是明确找到你想要的标签：

soup.find(class_='RESULTS').find(text='AUD/AED')

然后使用bs4 API从那里导航。

tr = soup.find(class_='RESULTS').find(text='AUD/AED').parent.parent

import re

tr.find(text=re.compile(r'\w+ \d{1,2} \w+ \d{4}'))
Out[66]: 'Wednesday 23 APR 2014'

这种方法不假设tr的孩子的布局，它只是寻找看起来像日期的AUD / AED标签的兄弟姐妹（根据正则表达式）。

Answer 2

这样的东西？假设soup是你的表。

cellIndex = 0
cells = soup.find_all('td')
while cellIndex < len(cells):
    if cells[cellIndex].text == u'AUD/AED':
        desiredIndex = cellIndex + 1
        break
    cellIndex += 1
if cellIndex != len(cells):
     #desiredIndex was found
     print(cells[desiredIndex].text)
else:
     print("cell not found")

Answer 3

我可能会使用lxml和XPath：

from StringIO import StringIO
from lxml import etree

tree = etree.parse(StringIO(table), etree.HTMLParser())
d = tree.xpath("//table[@class='RESULTS']/tr[./td[3][text()='AUD/AED']]/td[4]/text()")[0]

变量d应包含字符串“Wednesday 23 APR 2014”。

如果你真的想要BeautifulSoup，你可以混合使用lxml和BeautifulSoup，没问题。

Python Beautiful Soup打印精确的TD标签

3 个答案: