Python从本地html文件中提取信息

时间:2015-06-29 20:25:23

标签: python html beautifulsoup

我有这个本地网站,我想在

之后提取每一行
<font color='000000'> <u>PATTERN:</font>

以下是页面来源,它是Google代码上程序ApproxMAP的输出:

<! Created by program ApproxMAP by Hye-Chung(Monica) Kum>
<HTML><font size=5 face='Helvetica-Narrow'><b>
<font color='000000'> Cluster Support= [Pattern=</font>
<font color='000000'> 50</font>
<font color='000000'> % : Variation=</font>
<font color='000000'> 20</font>
<font color='000000'> %]; Database Support= [Min= </font>
<font color='000000'> 1</font>
<font color='000000'>  seq: Max=</font>
<font color='000000'> 50</font>
<font color='000000'> %]</font>
<BR>
<font color='a9a9a9'> cluster=0 size=3</font>
<font color='000000'>   =<100:</font>
<font color='434343'> 85:</font>
<font color='767676'> 70:</font>
<font color='a9a9a9'> 50:</font>
<font color='c8c8c8'> 35:</font>
<font color='e1e1e1'> 20></font>
<BR>
<font color='000000'> <u>PATTERN:</font>
<font color='000000'> {1,} {2,3,} {4,5,} 
</font>
<font color='000000'> =</font>
<font color='000000'> 5</font>
<font color='000000'> </u></font>
<BR>
<font color='000000'> {</font>
<font color='000000'> 1</font>
<font color='cbcbcb'> 12</font>
<font color='000000'> }</font>
<font color='000000'> {</font>
<font color='cbcbcb'> 24</font>
<font color='000000'> }</font>
<font color='000000'> {</font>
<font color='7f7f7f'> 2</font>
<font color='7f7f7f'> 3</font>
<font color='cbcbcb'> 25</font>
<font color='000000'> }</font>
<font color='000000'> {</font>
<font color='cbcbcb'> 1</font>
<font color='7f7f7f'> 4</font>
<font color='7f7f7f'> 5</font>
<font color='000000'> }</font>
<font color='000000'> {</font>
<font color='cbcbcb'> 26</font>
<font color='000000'> }</font>
<BR>
<font color='000000'> <u>PATTERN:</font>
<font color='000000'> {9,10,} {11,} {12,13,} 
</font>
<font color='000000'> =</font>
<font color='000000'> 5</font>
<font color='000000'> </u></font>
<BR>
<font color='000000'> {</font>
<font color='717171'> 9</font>
<font color='989898'> 10</font>
<font color='000000'> }</font>
<font color='000000'> {</font>
<font color='d3d3d3'> 11</font>
<font color='000000'> }</font>
<font color='000000'> {</font>
<font color='404040'> 11</font>
<font color='000000'> }</font>
<font color='000000'> {</font>
<font color='404040'> 12</font>
<font color='989898'> 13</font>
<font color='000000'> }</font>
<BR>
<font color='000000'> TOTAL LEN=</font>
<font color='000000'> 10</font>
<BR>
<BR>
</b></font></html>

在这种情况下,我想提取以下内容:

{1,} {2,3,} {4,5,} 
{9,10,} {11,} {12,13,} 

以下是我尝试的一些代码,但没有一个有效:

# First try
soup = BeautifulSoup('file:///H:/Approx_google_code/tiny20.html')
soup.findall('PATTERN:')

# Second try
re.search( "PATTERN:", 'file:///H:/Approx_google_code/tiny20.html')

# Third try
soup.body.findAll(text='PATTERN:')

# Forth try
soup.body.findAll(text=re.compile('PATTERN:'))

我长期以来一直坚持这个简单的问题,以至于我开始怀疑BeautifulSoup是否是正确的方向。我对HTML完全陌生,所以欢迎任何简单的解释/建议,谢谢。

我尝试了Why does bs4 return tags and then an empty list to this find_all() method?的示例,但没有结果。

1 个答案:

答案 0 :(得分:1)

找到包含PATTERN:文字的元素,找到font父级并获取下一个font同级元素:

soup = BeautifulSoup(data)

for elm in soup.find_all(text="PATTERN:"):
    print elm.find_parent("font").find_next_sibling("font").get_text(strip=True)

演示:

>>> from bs4 import BeautifulSoup
>>>
>>> data = """
... <! Created by program ApproxMAP by Hye-Chung(Monica) Kum>
... <HTML><font size=5 face='Helvetica-Narrow'><b>
... <font color='000000'> Cluster Support= [Pattern=</font>
... <font color='000000'> 50</font>
... <font color='000000'> % : Variation=</font>
... <font color='000000'> 20</font>
... <font color='000000'> %]; Database Support= [Min= </font>
... <font color='000000'> 1</font>
... <font color='000000'>  seq: Max=</font>
... <font color='000000'> 50</font>
... <font color='000000'> %]</font>
... <BR>
... <font color='a9a9a9'> cluster=0 size=3</font>
... <font color='000000'>   =<100:</font>
... <font color='434343'> 85:</font>
... <font color='767676'> 70:</font>
... <font color='a9a9a9'> 50:</font>
... <font color='c8c8c8'> 35:</font>
... <font color='e1e1e1'> 20></font>
... <BR>
... <font color='000000'> <u>PATTERN:</font>
... <font color='000000'> {1,} {2,3,} {4,5,}
... </font>
... <font color='000000'> =</font>
... <font color='000000'> 5</font>
... <font color='000000'> </u></font>
... <BR>
... <font color='000000'> {</font>
... <font color='000000'> 1</font>
... <font color='cbcbcb'> 12</font>
... <font color='000000'> }</font>
... <font color='000000'> {</font>
... <font color='cbcbcb'> 24</font>
... <font color='000000'> }</font>
... <font color='000000'> {</font>
... <font color='7f7f7f'> 2</font>
... <font color='7f7f7f'> 3</font>
... <font color='cbcbcb'> 25</font>
... <font color='000000'> }</font>
... <font color='000000'> {</font>
... <font color='cbcbcb'> 1</font>
... <font color='7f7f7f'> 4</font>
... <font color='7f7f7f'> 5</font>
... <font color='000000'> }</font>
... <font color='000000'> {</font>
... <font color='cbcbcb'> 26</font>
... <font color='000000'> }</font>
... <BR>
... <font color='000000'> <u>PATTERN:</font>
... <font color='000000'> {9,10,} {11,} {12,13,}
... </font>
... <font color='000000'> =</font>
... <font color='000000'> 5</font>
... <font color='000000'> </u></font>
... <BR>
... <font color='000000'> {</font>
... <font color='717171'> 9</font>
... <font color='989898'> 10</font>
... <font color='000000'> }</font>
... <font color='000000'> {</font>
... <font color='d3d3d3'> 11</font>
... <font color='000000'> }</font>
... <font color='000000'> {</font>
... <font color='404040'> 11</font>
... <font color='000000'> }</font>
... <font color='000000'> {</font>
... <font color='404040'> 12</font>
... <font color='989898'> 13</font>
... <font color='000000'> }</font>
... <BR>
... <font color='000000'> TOTAL LEN=</font>
... <font color='000000'> 10</font>
... <BR>
... <BR>
... </b></font></html>
... """
>>> 
>>> soup = BeautifulSoup(data)
>>> 
>>> for elm in soup.find_all(text="PATTERN:"):
...     print elm.find_parent("font").find_next_sibling("font").get_text(strip=True)
... 
{1,} {2,3,} {4,5,}
{9,10,} {11,} {12,13,}

请注意,由于我安装了lxmlBeautifulSoup将其用作底层解析器。我也试过了html.parser,它对我有用。 html5lib与前两个不同。无论如何,明确指定解析器:

soup = BeautifulSoup(data, "lxml")

或:

soup = BeautifulSoup(data, "html.parser")