Question

我从页面获取第一段并尝试提取适合作为标签或关键字的单词。在某些段落中有链接，我想删除标签：

例如，如果文字是

A <b>hex triplet</b> is a six-digit, three-<a href="/wiki/Byte"
enter code heretitle="Byte">byte</a> ...

我想删除

<b></b><a href="/wiki/Byte" title="Byte"></a>

最终得到这个

A hex triplet is a six-digit, three-byte ...

像这样的正则表达式不起作用：

>>> text = """A <b>hex triplet</b> is a six-digit, three-<a href="/wiki/Byte"
    enter code heretitle="Byte">byte</a> ..."""
>>> f = re.findall(r'<.+>', text)
>>> f
['<b>hex triplet</b>', '</a>']
>>>

这样做的最佳方式是什么？

我发现了几个类似的问题，但我认为没有一个能解决这个问题。

使用BeautifulSoup提取的示例进行更新（数据提取删除包含其文本的标记，并且必须分别为每个标记运行

>>> soup = BeautifulSoup(text)
>>> [s.extract() for s in soup('b')]
[<b>hex triplet</b>]
>>> soup
A  is a six-digit, three-<a href="/wiki/Byte" enter code heretitle="Byte">byte</a> ...
>>> [s.extract() for s in soup('a')]
[<a href="/wiki/Byte" enter code heretitle="Byte">byte</a>]
>>> soup
A  is a six-digit, three- ...
>>>

更新

对于有相同问题的人：如Brendan Long所述，this answer使用HtmlParser效果最佳。

Answer 1

Beautiful Soup是您问题的答案！尝试一下，它非常棒！

一旦你使用它，Html解析就会变得如此简单。

>>> text = """A <b>hex triplet</b> is a six-digit, three-<a href="/wiki/Byte"
... enter code heretitle="Byte">byte</a> ..."""
>>> soup = BeautifulSoup(text)
>>> ''.join(soup.findAll(text=True))
u'A hex triplet is a six-digit, three-byte ...'

如果您想要提取的所有文字都包含在某些外部标记中，例如<body> ... </body>或某些<div id="X"> .... </div>，那么您可以执行以下操作（此图假设您想要的所有文字提取包含在<body>标记内。现在，您可以从一些所需的标签中有选择地提取文本。

（查看文档和示例，您将找到许多解析DOM的方法）

>>> text = """<body>A <b>hex triplet</b> is a six-digit, 
... three-<a href="/wiki/Byte"
... enter code heretitle="Byte">byte</a>
... </body>"""
>>> soup = BeautifulSoup(text)
>>> ''.join(soup.body.findAll(text=True))
u'A hex triplet is a six-digit, three-byte'

Answer 2

+量词是贪婪的，这意味着它会找到最长的匹配。添加?以强制它找到最短可能匹配：

>>> re.findall(r'<.+?>', text)
['<b>', '</b>', '</a>']

编写正则表达式的另一种方法是使用[^>]而不是.明确排除标记内的右尖括号。

>>> re.findall(r'<[^>]+>', text)
['<b>', '</b>', '<a href="/wiki/Byte"\n    enter code heretitle="Byte">', '</a>']

这种方法的一个优点是它也会匹配换行符（\n）。如果添加re.DOTALL标记，则可以使用.获得相同的行为。

>>> re.findall(r'<.+?>', text, re.DOTALL)
['<b>', '</b>', '<a href="/wiki/Byte"\n    enter code heretitle="Byte">', '</a>']

要删除代码，请使用re.sub：

>>> re.sub(r'<.+?>', '', text, flags=re.DOTALL)
'A hex triplet is a six-digit, three-byte ...'

Answer 3

这只是剥离标签的基本要素。包括缺失的元素，
下面的\ w表示带前缀和正文的合格的unicode标签名称，
需要一个join（）语句来形成子表达式。解析的优点
带有正则表达式的html / xml是不会在第一个格式错误的实例上失败的非常适合修复它！副作用是它的速度慢，特别是用unicode。

不幸的是，剥离标签会破坏内容，因为根据定义，标记格式内容。

在大网页上试试这个。这应该可以翻译成python。

$rx_expanded = '
<
(?:
    (?:
       (?:
           (?:script|style) \s*
         | (?:script|style) \s+ (?:".*?"|\'.*?\'|[^>]*?)+\s*
       )> .*? </(?:script|style)\s*
    )
  |
    (?:
        /?\w+\s*/?
      | \w+\s+ (?:".*?"|\'.*?\'|[^>]*?)+\s*/?
      | !(?:DOCTYPE.*?|--.*?--)
    )
)
>
';

$html =~ s/$rx_expanded/[was]/xsg;

如何消除html标签？

3 个答案: