我在本地系统上存储了以下html文件:
<span style="position:absolute; border: gray 1px solid; left:0px; top:50px; width:612px; height:792px;"></span>
<div style="position:absolute; top:50px;"><a name="1">Page 1</a></div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:45px; top:71px; width:322px; height:38px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:30px">One
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:45px; top:104px; width:175px; height:40px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:15px">Two</span><span style="font-family: CAAAAA+DejaVuSans; font-size:16px">: two txt
<br></span><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:15px">Three</span><span style="font-family: CAAAAA+DejaVuSans; font-size:16px">: Three txt
<br></span><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:15px">Four</span><span style="font-family: CAAAAA+DejaVuSans; font-size:16px">: Four txt
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:274px; top:144px; width:56px; height:19px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:19px">FIVE
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:45px; top:171px; width:515px; height:44px;"><span style="font-family: CAAAAA+DejaVuSans; font-size:18px">five txt
<br>five txt2
<br>five txt3
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:220px; top:223px; width:164px; height:19px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:19px">SIX
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:44px; top:247px; width:489px; height:159px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:17px">six txt
<br></span><span style="font-family: CAAAAA+DejaVuSans; font-size:18px">six txt2
<br>- six txt2
<br>• six txt3
<br>• six txt4
<br>• six txt5
<br></span>
我需要提取此html文件中出现的所有字体大小。我正在使用beautifulsoup,但我只知道如何提取文本。
我可以使用以下代码提取文本:
from bs4 import BeautifulSoup
htmlData = open('/home/usr/Downloads/files/output.html', 'r')
soup = BeautifulSoup(htmlData)
texts = soup.findAll(text=True)
我需要提取每段文本的字体大小,并将字体 - 文本对存储到列表或数组中。具体来说,我希望有一个像[('One','30'),('Two','15')]
这样的数据结构,其中30来自font-size:30px
,15来自font-size:15px
唯一的问题是我无法找到获取字体大小值的方法。有任何想法吗?
答案 0 :(得分:1)
希望这会有所帮助:我建议您阅读BeautifulSoup
from bs4 import BeautifulSoup
htmlData = open('/home/usr/Downloads/files/output.html', 'r')
soup = BeautifulSoup(htmlData)
font_spans = [ data for data in soup.select('span') if 'font-size' in str(data) ]
output = []
for i in font_spans:
tup = ()
fonts_size = re.search(r'(?is)(font-size:)(.*?)(px)',str(i.get('style'))).group(2)
tup = (str(i.text).strip(),fs.strip())
output.append(tup)
print(output)
[('One', '30'),('Two', '15'), ....]
如果您要删除包含txt
的文字值,可以添加if not 'txt' in i.text:
说明:
首先,您需要识别包含font-size
,
font_spans = [ data for data in soup.select('span') if 'font-size' in str(data) ]
然后你需要迭代font_spans
并提取字体大小和文本值,
textvalue = i.text # One,Two..
fonts_size = re.search(r'(?is)(font-size:)(.*?)(px)',str(i.get('style'))).group(2) # 30, 15, 16..
最后你需要创建一个列表,其中包含元组中的所有输出。
答案 1 :(得分:1)
您可以使用 css select select("[style*=font-size]")
使用包含 font-size 的样式属性来标记标记,并使用正则表达式提取值:< / p>
In [12]: from bs4 import BeautifulSoup
In [13]: import re
In [14]: soup = BeautifulSoup(html, "html.parser")
In [15]: patt = re.compile("font-size:(\d+)")
In [16]: [(tag.text.strip(), patt.search(tag["style"]).group(1)) for tag in soup.select("[style*=font-size]")]
Out[16]:
[('One', '30'),
('Two', '15'),
(': two txt', '16'),
('Three', '15'),
(': Three txt', '16'),
('Four', '15'),
(': Four txt', '16'),
('FIVE', '19'),
('five txt\nfive txt2\nfive txt3', '18'),
('SIX', '19'),
('six txt', '17'),
('six txt2\n- six txt2\n• six txt3\n• six txt4\n• six txt5', '18')]
答案 2 :(得分:0)
你必须为自己做一些研究,beautiful soup documentation和regex doc是你应该阅读并理解事物是如何流动的。
查看以下示例,该示例使用正则表达式提取第一次出现的font-size,然后正确拆分以仅获取像素数。
from bs4 import BeautifulSoup as Soup
from bs4 import Tag
import re
data = """
<span style="position:absolute; border: gray 1px solid; left:0px; top:50px; width:612px; height:792px;"></span>
<div style="position:absolute; top:50px;"><a name="1">Page 1</a></div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:45px; top:71px; width:322px; height:38px;">
<span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:30px">One
<br></span>
</div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:45px; top:104px; width:175px; height:40px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:15px">Two</span><span style="font-family: CAAAAA+DejaVuSans; font-size:16px">: two txt
<br></span><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:15px">Three</span><span style="font-family: CAAAAA+DejaVuSans; font-size:16px">: Three txt
<br></span><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:15px">Four</span><span style="font-family: CAAAAA+DejaVuSans; font-size:16px">: Four txt
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:274px; top:144px; width:56px; height:19px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:19px">FIVE
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:45px; top:171px; width:515px; height:44px;"><span style="font-family: CAAAAA+DejaVuSans; font-size:18px">five txt
<br>five txt2
<br>five txt3
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:220px; top:223px; width:164px; height:19px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:19px">SIX
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:44px; top:247px; width:489px; height:159px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:17px">six txt
<br></span><span style="font-family: CAAAAA+DejaVuSans; font-size:18px">six txt2
<br>- six txt2
<br> six txt3
<br> six txt4
<br> six txt5
<br></span>
"""
soup = Soup(data, 'html.parser')
def get_the_start_of_font(attr):
""" Return the index of the 'font-size' first occurrence or None. """
match = re.search(r'font-size:', attr)
if match is not None:
return match.start()
return None
def get_font_size_from(attr):
""" Return the font size as string or None if not found. """
font_start_i = get_the_start_of_font(attr)
if font_start_i is not None:
return str(attr[font_start_i + len('font-size:'):].split('px')[0])
return None
# iterate through all descendants:
fonts = []
for child in soup.descendants:
if isinstance(child, Tag) is True and child.get('style') is not None:
font = get_font_size_from(child.get('style'))
if font is not None:
fonts.append([
str(child.text).strip(), font])
print(fonts)
解决方案可以改进,但这是一个有效的例子。