Question

我有一个包含HTML的Python字典，后来我想使用beautifulsoup解析，但在解析之前我想删除与标记元素直接相邻的空格。

例如：

string = "text <tag>some texts</tag> <tag> text</tag> some text"
>>> remove_whitespace(string)
'text<tag>some texts</tag><tag>text</tag>some text'

Answer 1

假设您允许使用任何类型的标记名称，并且该标记从不包含尖括号，您可以使用正则表达式快速解决此问题：

>>> import re
>>> string = "text <tag>some texts</tag> <tag> text</tag> some text"
>>> regex = re.compile(r"\s*(<[^<>]+>)\s*")
>>> regex.sub("\g<1>", string)
'text<tag>some texts</tag><tag>text</tag>some text'

<强>解释

\s*     # Match any number of whitespace characters
(       # Match and capture in group 1:
 <      # - an opening angle bracket
 [^<>]+ # - one or more characters except angle brackets
 >      # - a closing angle bracket
)       # End of group 1 (used to restore the matched text later)
\s*     # Match any number of whitespace characters

在解析之前从HTML中剥离空格

1 个答案: