Question

我可以从以下页面中删除上标：

https://www.sec.gov/Archives/edgar/data/1633917/000163391718000094/exhibit991prq12018pypl.htm

此帖子在这里：Beautiful soup remove superscripts

但是现在我有一个未标记sup

的上标

https://www.sec.gov/Archives/edgar/data/1549802/000110465918031489/a18-13128_1ex99d1.htm

Net revenues后面是没有1标记的上标sup。

如何从以下帖子中删除此上标：Beautiful soup remove superscripts？

Answer 1

看起来该元素具有以下格式：

<font size="1" style="font-size:6.5pt;font-weight:bold;position:relative;top:-3.0pt;">1</font>

因此，我们在这里可以看到它们正在使用一种字体格式化文本，其中重要的部分是样式position:relative和top:的值。我将亲自编写一个可以扩展的功能，以检测上标并将其删除。例如：

def Remove_Superscripts(soup):
    # Simple superscript extraction
    for element in soup.find_all('sup'):
        element.extract()

    # More complex superscript extraction for this example:
    for element in soup.find_all(lambda e: e and e.name == 'font' and e.has_attr('style') and
                                           'position:relative' in e['style'] and
                                           'top:' in e['style']:
        element.extract()

这是一个非常懒惰和混乱的示例，但是它应该使您了解如何删除未标记有<sup\>标记的上标标记。不幸的是，每次遇到有人构造不同上标的新情况时，您都需要扩展和修改此方法（我将努力使其尽可能开放和通用）。

Answer 2

类似这样的东西：

fonts = soup.select('font[style*="position:relative"]')
for font in fonts:
    font.decompose()

Beautifulsoup上标不是“ sup”

2 个答案: