我有一个html数据,我只想提取粗体字体类型下的文本。
<span style="font-family: ABCDEE+Cambria,Bold; font-size:9px">Pinecone Functions
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:419px; top:1903px; width:76px; height:11px;"><span style="font-family: ABCDEE+Calibri,Bold; font-size:7px">Trainee Sign-Off
<br></span></div>
我只想要字体家族下的文本:ABCDEE + Cambria,粗体。
with open('/home/output4.html') as file:
text = file.read()
soup = BeautifulSoup(text, 'html.parser')
x = soup.find_all('span', style=re.compile(r'font-family: ABCDEE+Cambria,Bold.*'))
for rows in x:
print(rows.text)
我已经尝试过让此bt获取空列表。
答案 0 :(得分:0)
+
是正则表达式中的特殊字符,您应该对其进行转义(请注意,\+
而非+
)
示例:
from bs4 import BeautifulSoup
import re
text = """
<span style="font-family: ABCDEE+Cambria,Bold; font-size:9px">Pinecone Functions
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:419px; top:1903px; width:76px; height:11px;"><span style="font-family: ABCDEE+Calibri,Bold; font-size:7px">Trainee Sign-Off
<br></span></div>
"""
soup = BeautifulSoup(text, 'html.parser')
x = soup.find_all('span', style=re.compile(r'font-family: ABCDEE\+Cambria,Bold.*'))
for rows in x:
print(rows.text)
输出:
Pinecone函数