根据元素字符串中的特定单词搜索HTML元素

时间:2017-05-26 22:28:42

标签: python beautifulsoup

尝试创建一个程序,可以使用Beautiful Soup模块查找和替换某些指定元素中的标记。但是 - 我无法通过在元素字符串中找到的特定单词“搜索”来找出如何“找到”这些元素。假设我可以通过指定的字符串字符串让我的代码“找到”这些元素,然后我将“解包”元素的“p”标记并将它们“包装”在新的“h1”标记中。

以下是一些示例HTML代码作为输入:

<p> ExampleStringWord#1 needs to “find” this entire element based on the "finding" of the first word </p>
<p> Example#2  this element ignored </p>
<p> ExampleStringWord#1 needs to find this entire element as well because the first word of this string is what I’m “searching” for, even though the wording after the first word in the string is different <p>

到目前为止,这是我的代码(通过“ExampleStringWord#1”搜索):

for h1_tag in soup.find_all(string="ExampleStringWord#1"):
            soup.p.wrap(soup.h1_tag("h1"))

如果使用上面的示例HTML输入,我希望代码如下:

<h1> ExampleStringWord#1 needs to “find” this entire element based on the "finding" of the first word </h1>
<p> Example#2  this element ignored </p>
<h1> ExampleStringWord#1 needs to find this entire element as well because the first word of this string is what I’m “searching” for, even though the wording after the first word in the string is different <h1>

但是,我的代码只查找明确包含“ExampleStringWord#1”的元素,并且将排除包含任何字符串过去的元素。 我确信我将以某种方式需要利用正则表达式来查找我指定的单词(除了后面的任何字符串措辞)元素。但是,我对正则表达式并不是很熟悉,所以我不确定如何与BeautifulSoup模块一起使用它。

另外 - 我查看了Beautiful Soup中的文档,将正则表达式作为过滤器(https://www.crummy.com/software/BeautifulSoup/bs4/doc/#a-regular-expression)传递,但我无法在我的情况下使其工作。我还回顾了其他有关通过美味汤传递正则表达式的帖子,但我没有发现任何能够充分解决我的问题的内容。 任何帮助赞赏!

1 个答案:

答案 0 :(得分:2)

如果您要使用指定的子字符串找到p元素(请注意re.compile()部分)然后将该元素的名称替换为h1,该怎么办:

import re

from bs4 import BeautifulSoup

data = """
<body>
    <p> ExampleStringWord#1 needs to “find” this entire element based on the "finding" of the first word </p>
    <p> Example#2  this element ignored </p>
    <p> ExampleStringWord#1 needs to find this entire element as well because the first word of this string is what I’m “searching” for, even though the wording after the first word in the string is different </p>
</body>
"""

soup = BeautifulSoup(data, "html.parser")
for p in soup.find_all("p", string=re.compile("ExampleStringWord#1")):
    p.name = 'h1'
print(soup)

打印:

<body>
    <h1> ExampleStringWord#1 needs to “find” this entire element based on the "finding" of the first word </h1>
    <p> Example#2  this element ignored </p>
    <h1> ExampleStringWord#1 needs to find this entire element as well because the first word of this string is what I’m “searching” for, even though the wording after the first word in the string is different </h1>
</body>