如何将 str 转换为漂亮的汤标签

时间:2021-07-05 12:01:44

标签: python html beautifulsoup

我有两个这样的 html:

<h3>
First heading 
</h3>
<ol>
<li>
hi
</li>
</ol>
<h3>
Second 
</h3>
<ol>
<li>
second
</li>
</ol>

文档 2

<h3>
First heading 
</h3>
<ol>
<li>
hello
</li>
</ol>
<h3>
Second 
</h3>
<ol>
<li>
second to second
</li>
</ol>

我需要将第二个文档中的 li 附加到相关 h3 下的第一个文档的 html 中。 这是我的代码

soup = BeautifulSoup(html_string)
h3_tags = soup.find_all('h3')
ol_tags = [each_h3.find_next('ol') for each_h3 in h3_tags]


soup = BeautifulSoup(html_string_new)
h3_tags_new = soup.find_all('h3')
ol_tags_new = [each_h3.find_next('ol') for each_h3 in h3_tags_new]

countries_old = []
countries_new = []
html_new = ""
for i in h3_tags:
    countries_old.append(i.text)
for i in h3_tags_new:
    countries_new.append(i.text)

for country in countries_new:
    idx = countries_old.index(country)
    tag = str(ol_tags[idx])

    tag = tag[:-5]
    tag = tag[4:]
    idx_new = countries_new.index(country)
    tag_new = str(ol_tags_new[idx_new])

    tag_new = tag_new[:-5]
    tag_new = tag_new[4:]
    tag = "<ol>" + tag + tag_new + "</ol>"
    ol_tags[idx] = tag

    html_new += h3_tags[idx]
    html_new += tag


with open("check.html", "w", encoding="utf8") as html_file:
    html_file.write(html_new)
    html_file.close()

import pypandoc
output = pypandoc.convert(source='check.html', format='html', to='docx', outputfile='test.docx', extra_args=["-M2GB", "+RTS", "-K64m", "-RTS"])

此代码从第二个文档中获取 h3 检查其索引,并从同一索引中获取来自第二个文档的 ol 的值。然后它从这些中删除 ol 标签并将这两个连接在一起。它一直将这些存储在 html_file 中。但是当我将 ol 与 h3 连接时,会出现此错误:

TypeError: can only concatenate str (not "Tag") to str

编辑: 预期输出:

<h3>
First heading 
</h3>
<ol>
<li>
hello
</li>
<li>
hi
</li>
</ol>
<h3>
Second 
</h3>
<ol>
<li>
second to second
</li>
<li>
second
</li>
</ol>

1 个答案:

答案 0 :(得分:0)

试试:

const s = "1fr minmax(75px, auto) fit-content(40%) repeat(3, 200px) 200px repeat(auto-fill, 100px) 300px repeat(auto-fill, minmax(75px, auto))";
const regex = /[^\s()]+(?:\([^\s()]+(?:\([^()]+\))?(?:, *[^\s()]+(?:\([^()]+\))?)*\))?/g;
let result = Array.from(s.matchAll(regex), m => m[0])
console.log(result);

打印:

from bs4 import BeautifulSoup

html1 = """
<h3>
First heading 
</h3>
<ol>
<li>
hi
</li>
</ol>
<h3>
Second 
</h3>
<ol>
<li>
second
</li>
</ol>
"""

html2 = """
<h3>
First heading 
</h3>
<ol>
<li>
hello
</li>
</ol>
<h3>
Second 
</h3>
<ol>
<li>
second to second
</li>
</ol>
"""

soup1 = BeautifulSoup(html1, "html.parser")
soup2 = BeautifulSoup(html2, "html.parser")

for li in soup2.select("h3 + ol > li"):
    h3_text = li.find_previous("h3").get_text(strip=True)
    h3_soup1 = soup1.find("h3", text=lambda t: h3_text in t)
    if not h3_soup1:
        continue
    h3_soup1.find_next("ol").insert(0, li)

print(soup1.prettify())