Question

我想从看起来像这样的html页面中抓取一些数据

<tr>
 <td> Some information <td>
 <td> 123 </td>
</tr>
<tr>
 <td> some other information </td>
 <td> 456 </td>
</tr>
<tr>
 <td> and the info continues </td>
 <td> 789 </td>
</tr>

我想要的是获取给定html行之后的html行。也就是说，如果我看到“其他信息”，则需要输出“ 456”。我曾想过将regex与BeautifulSoup的.find_next结合使用，但是我对此没有任何运气（我对regex也不太熟悉）。有人知道该怎么做吗？预先，非常感谢

Answer 1

实际上，在BeautifulSoup中结合使用regex和find_next，您可以实现所需的目标：

from bs4 import BeautifulSoup
import re

html = """
<tr>
 <td> Some information <td>
 <td> 123 </td>
</tr>
<tr>
 <td> some other information </td>
 <td> 456 </td>
</tr>
<tr>
 <td> and the info continues </td>
 <td> 789 </td>
</tr>
"""

soup = BeautifulSoup(html)
x = soup.find('td', text = re.compile('some other information'))
print(x.find_next('td').text)

输出

'456'

EDIT 用较短的x.find_next('td').contents[0]代替了x.find_next('td').text

如何抓取在另一HTML行之后的特定HTML行

1 个答案: