Question

我正在编写一个使用漂亮汤的python脚本，我必须从包含一些HTML代码的字符串中获取一个开始标记。

这是我的字符串：

string = <p>...</p>

我想在名为<p>的变量opening_tag和</p>变量中获取closing_tag。我搜索了文档，但似乎找不到解决方案。任何人都可以告诉我这个吗？

Answer 1

在BeautifulSoup中没有直接的方式来打开和关闭标记的部分内容，但至少可以获得name的标记：

>>> from bs4 import BeautifulSoup
>>> 
>>> html_content = """
... <body>
...     <p>test</p>
... </body>
...  """
>>> soup = BeautifulSoup(html_content, "lxml")
>>> p = soup.p
>>> print(p.name)
p

使用html.parser但您可以收听“开始”和“结束”标记“事件”。

Answer 2

有一种方法可以使用 BeautifulSoup 和一个简单的正则表达式来做到这一点：

将段落放在一个 BeautifulSoup 对象中，例如，soupParagraph。
对于开始 (<p>) 和结束 (</p>) 标签之间的内容，将内容移动到另一个 BeautifulSoup 对象，例如，soupInnerParagraph。（通过移动内容，它们不会被删除）。
然后，soupParagraph 将只有开始和结束标记。
将soupParagraph 转换为HTML 文本格式并将其存储在字符串变量中
要获取开始标记，请使用正则表达式从字符串变量中删除结束标记。

一般来说，使用正则表达式解析 HTML 是有问题的，通常最好避免。但是，这里可能是合理的。

结束标签很简单。它没有为其定义属性，并且其中不允许有注释。

Can I have attributes on closing tags?

HTML Comments inside Opening Tag of the Element

此代码从 <body...> ... </body> 部分获取开始标记。代码已经过测试。

# The variable "body" is a BeautifulSoup object that contains a <body> section.
bodyInnerHtml = BeautifulSoup("", 'html.parser')
bodyContentsList = body.contents
for i in range(0, len(bodyContentsList)):
    # .append moves the HTML element from body to bodyInnerHtml
    bodyInnerHtml.append(bodyContentsList[0])

# Convert the <body> opening and closing tags to HTML text format
bodyTags = body.decode(formatter='html')
# Extract the opening tag, by removing the closing tag
regex = r"(\s*<\/body\s*>\s*$)\Z"
substitution = ""
bodyOpeningTag, substitutionCount = re.subn(regex, substitution, bodyTags, 0, re.M)
if (substitutionCount != 1):
    print("")
    print("ERROR.  The expected HTML </body> tag was not found.")

如何从HTML字符串中获取美丽汤中的开始和结束标记？

2 个答案: