需要使用beautifulsoup提取所有字体大小和文本

时间:2016-08-18 07:57:18

标签: python html fonts beautifulsoup

我在本地系统上存储了以下html文件:

<span style="position:absolute; border: gray 1px solid; left:0px; top:50px; width:612px; height:792px;"></span>
<div style="position:absolute; top:50px;"><a name="1">Page 1</a></div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:45px; top:71px; width:322px; height:38px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:30px">One
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:45px; top:104px; width:175px; height:40px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:15px">Two</span><span style="font-family: CAAAAA+DejaVuSans; font-size:16px">: two txt
<br></span><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:15px">Three</span><span style="font-family: CAAAAA+DejaVuSans; font-size:16px">: Three txt
<br></span><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:15px">Four</span><span style="font-family: CAAAAA+DejaVuSans; font-size:16px">: Four txt
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:274px; top:144px; width:56px; height:19px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:19px">FIVE
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:45px; top:171px; width:515px; height:44px;"><span style="font-family: CAAAAA+DejaVuSans; font-size:18px">five txt
<br>five txt2 
<br>five txt3
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:220px; top:223px; width:164px; height:19px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:19px">SIX
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:44px; top:247px; width:489px; height:159px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:17px">six txt
<br></span><span style="font-family: CAAAAA+DejaVuSans; font-size:18px">six txt2
<br>- six txt2
<br>• six txt3
<br>• six txt4 
<br>• six txt5
<br></span>

我需要提取此html文件中出现的所有字体大小。我正在使用beautifulsoup,但我只知道如何提取文本。

我可以使用以下代码提取文本:

from bs4 import BeautifulSoup
htmlData = open('/home/usr/Downloads/files/output.html', 'r')
soup = BeautifulSoup(htmlData)

texts = soup.findAll(text=True)

我需要提取每段文本的字体大小,并将字体 - 文本对存储到列表或数组中。具体来说,我希望有一个像[('One','30'),('Two','15')]这样的数据结构,其中30来自font-size:30px,15来自font-size:15px

唯一的问题是我无法找到获取字体大小值的方法。有任何想法吗?

3 个答案:

答案 0 :(得分:1)

希望这会有所帮助:我建议您阅读BeautifulSoup

上的更多文档
from bs4 import BeautifulSoup
htmlData = open('/home/usr/Downloads/files/output.html', 'r')
soup = BeautifulSoup(htmlData)

font_spans = [ data for data in soup.select('span') if 'font-size' in str(data) ]
output = []
for i in font_spans:
    tup = ()
    fonts_size = re.search(r'(?is)(font-size:)(.*?)(px)',str(i.get('style'))).group(2)
    tup = (str(i.text).strip(),fs.strip())
    output.append(tup)

print(output)
[('One', '30'),('Two', '15'), ....]

如果您要删除包含txt的文字值,可以添加if not 'txt' in i.text:

说明:

首先,您需要识别包含font-size

的标签
font_spans = [ data for data in soup.select('span') if 'font-size' in str(data) ]

然后你需要迭代font_spans并提取字体大小和文本值,

textvalue = i.text # One,Two..
fonts_size = re.search(r'(?is)(font-size:)(.*?)(px)',str(i.get('style'))).group(2) # 30, 15, 16..

最后你需要创建一个列表,其中包含元组中的所有输出。

答案 1 :(得分:1)

您可以使用 css select select("[style*=font-size]")使用包含 font-size 的样式属性来标记标记,并使用正则表达式提取值:< / p>

In [12]: from bs4 import BeautifulSoup

In [13]: import re

In [14]: soup = BeautifulSoup(html, "html.parser")

In [15]: patt = re.compile("font-size:(\d+)")

In [16]: [(tag.text.strip(), patt.search(tag["style"]).group(1)) for tag in soup.select("[style*=font-size]")]
Out[16]: 
[('One', '30'),
 ('Two', '15'),
 (': two txt', '16'),
 ('Three', '15'),
 (': Three txt', '16'),
 ('Four', '15'),
 (': Four txt', '16'),
 ('FIVE', '19'),
 ('five txt\nfive txt2\nfive txt3', '18'),
 ('SIX', '19'),
 ('six txt', '17'),
 ('six txt2\n- six txt2\n• six txt3\n• six txt4\n• six txt5', '18')]

答案 2 :(得分:0)

你必须为自己做一些研究,beautiful soup documentationregex doc是你应该阅读并理解事物是如何流动的。

查看以下示例,该示例使用正则表达式提取第一次出现的font-size,然后正确拆分以仅获取像素数。

from bs4 import BeautifulSoup as Soup
from bs4 import Tag
import re

data = """
  <span style="position:absolute; border: gray 1px solid; left:0px; top:50px; width:612px; height:792px;"></span>
  <div style="position:absolute; top:50px;"><a name="1">Page 1</a></div>
  <div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:45px; top:71px; width:322px; height:38px;">
    <span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:30px">One
    <br></span>
  </div>
  <div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:45px; top:104px; width:175px; height:40px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:15px">Two</span><span style="font-family: CAAAAA+DejaVuSans; font-size:16px">: two txt
  <br></span><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:15px">Three</span><span style="font-family: CAAAAA+DejaVuSans; font-size:16px">: Three txt
  <br></span><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:15px">Four</span><span style="font-family: CAAAAA+DejaVuSans; font-size:16px">: Four txt
  <br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:274px; top:144px; width:56px; height:19px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:19px">FIVE
  <br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:45px; top:171px; width:515px; height:44px;"><span style="font-family: CAAAAA+DejaVuSans; font-size:18px">five txt
  <br>five txt2 
  <br>five txt3
  <br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:220px; top:223px; width:164px; height:19px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:19px">SIX
  <br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:44px; top:247px; width:489px; height:159px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:17px">six txt
  <br></span><span style="font-family: CAAAAA+DejaVuSans; font-size:18px">six txt2
  <br>- six txt2
  <br> six txt3
  <br> six txt4 
  <br> six txt5
  <br></span>
"""
soup = Soup(data, 'html.parser')

def get_the_start_of_font(attr):
  """ Return the index of the 'font-size' first occurrence or None. """
  match = re.search(r'font-size:', attr)
  if match is not None:
    return match.start()
  return None 

def get_font_size_from(attr):
  """ Return the font size as string or None if not found. """
  font_start_i = get_the_start_of_font(attr)
  if font_start_i is not None:
    return str(attr[font_start_i + len('font-size:'):].split('px')[0])
  return None

# iterate through all descendants:
fonts = []
for child in soup.descendants:
  if isinstance(child, Tag) is True and child.get('style') is not None:
    font = get_font_size_from(child.get('style'))
    if font is not None:
      fonts.append([
        str(child.text).strip(), font])

print(fonts)

解决方案可以改进,但这是一个有效的例子。