Question

我需要在下面的代码片段中提取结尾标记和
标记之间的数据：

<td><b>First Type :</b>W<br><b>Second Type :</b>65<br><b>Third Type :</b>3</td>

我需要的是： W，65,3

但问题是这些值也可能是空的，比如 -

<td><b>First Type :</b><br><b>Second Type :</b><br><b>Third Type :</b></td>

我想获取这些值，如果存在其他空字符串

我尝试使用 nextSibling 和 find_next（＆＃39; br＆＃39;）但是它返回了

 <br><b>Second Type :</b><br><b>Third Type :</b></br></br>

和

<br><b>Third Type :</b></br>

如果标签之间没有值（W，65,3）

</b> and <br>

我需要的是，如果这些标签之间没有任何内容，它应该返回一个空字符串。

Answer 1

我会根据代码策略使用代码，查看其next_sibling包含的信息类型。

我只是检查他们的next_sibling.string是否不是None，并相应地追加列表：）

>>> html = """<td><b>First Type :</b><br><b>Second Type :</b>65<br><b>Third Type :</b>3</td>"""

>>> soup = BeautifulSoup(html, "html.parser")
>>> b = soup.find_all("b")
>>> data = []
>>> for tag in b:
        if tag.next_sibling.string == None:
            data.append(" ")
        else:
            data.append(tag.next_sibling.string)
>>> data 
[' ', u'65', u'3'] # Having removed the first string

希望这有帮助！

Answer 2

我会搜索td对象，然后使用regex模式过滤您需要的数据，而不是使用re.compile方法中的find_all。

像这样：

import re
from bs4 import BeautifulSoup

example = """<td><b>First Type :</b>W<br><b>Second Type :</b>65<br><b>Third 
Type :</b>3</td>
<td><b>First Type :</b><br><b>Second Type :</b>69<br><b>Third Type :</b>6</td>"""

soup = BeautifulSoup(example, "html.parser")

for o in soup.find_all('td'):
    match = re.findall(r'</b>\s*(.*?)\s*(<br|</br)', str(o))
    print ("%s,%s,%s" % (match[0][0],match[1][0],match[2][0]))

此模式查找代码与 或代码之间的所有文字。将汤对象转换为字符串时会添加标记。

此示例输出：

W，65,3

，69.6

举一个例子，如果其中一个正则表达式匹配为空，则可以更改为返回空字符串。

Answer 3

In [5]: [child for child in soup.td.children if isinstance(child, str)]
Out[5]: ['W', '65', '3']

这些文字和标签是td的孩子，您可以使用contents（列表）或children（生成器）访问它们

In [4]: soup.td.contents
Out[4]: 
[<b>First Type :</b>,
 'W',
 <br/>,
 <b>Second Type :</b>,
 '65',
 <br/>,
 <b>Third Type :</b>,
 '3']

然后您可以通过测试来获取文本是否是str

的实例

Answer 4

我认为这有效：

因为您想要的值总是在标记结束之后，所以很容易以这种方式捕获它们，不需要重新创建。

如何使用美丽的汤来获得两个不同标签之间的价值？

4 个答案: