从html中提取基于字体族类型的文本

时间:2019-07-01 10:26:53

标签: python html regex beautifulsoup html-parsing

我有一个html数据,我只想提取粗体字体类型下的文本。

<span style="font-family: ABCDEE+Cambria,Bold; font-size:9px">Pinecone Functions 
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:419px; top:1903px; width:76px; height:11px;"><span style="font-family: ABCDEE+Calibri,Bold; font-size:7px">Trainee Sign-Off 
<br></span></div>

我只想要字体家族下的文本:ABCDEE + Cambria,粗体。

with open('/home/output4.html') as file:
    text = file.read()

soup = BeautifulSoup(text, 'html.parser')

x = soup.find_all('span', style=re.compile(r'font-family: ABCDEE+Cambria,Bold.*'))
for rows in x:
    print(rows.text)

我已经尝试过让此bt获取空列表。

1 个答案:

答案 0 :(得分:0)

+是正则表达式中的特殊字符,您应该对其进行转义(请注意,\+而非+

示例:

from bs4 import BeautifulSoup
import re

text = """
<span style="font-family: ABCDEE+Cambria,Bold; font-size:9px">Pinecone Functions 
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:419px; top:1903px; width:76px; height:11px;"><span style="font-family: ABCDEE+Calibri,Bold; font-size:7px">Trainee Sign-Off 
<br></span></div>
"""

soup = BeautifulSoup(text, 'html.parser')

x = soup.find_all('span', style=re.compile(r'font-family: ABCDEE\+Cambria,Bold.*'))
for rows in x:
    print(rows.text)

输出:

  

Pinecone函数