我有这个本地网站,我想在
之后提取每一行<font color='000000'> <u>PATTERN:</font>
以下是页面来源,它是Google代码上程序ApproxMAP的输出:
<! Created by program ApproxMAP by Hye-Chung(Monica) Kum>
<HTML><font size=5 face='Helvetica-Narrow'><b>
<font color='000000'> Cluster Support= [Pattern=</font>
<font color='000000'> 50</font>
<font color='000000'> % : Variation=</font>
<font color='000000'> 20</font>
<font color='000000'> %]; Database Support= [Min= </font>
<font color='000000'> 1</font>
<font color='000000'> seq: Max=</font>
<font color='000000'> 50</font>
<font color='000000'> %]</font>
<BR>
<font color='a9a9a9'> cluster=0 size=3</font>
<font color='000000'> =<100:</font>
<font color='434343'> 85:</font>
<font color='767676'> 70:</font>
<font color='a9a9a9'> 50:</font>
<font color='c8c8c8'> 35:</font>
<font color='e1e1e1'> 20></font>
<BR>
<font color='000000'> <u>PATTERN:</font>
<font color='000000'> {1,} {2,3,} {4,5,}
</font>
<font color='000000'> =</font>
<font color='000000'> 5</font>
<font color='000000'> </u></font>
<BR>
<font color='000000'> {</font>
<font color='000000'> 1</font>
<font color='cbcbcb'> 12</font>
<font color='000000'> }</font>
<font color='000000'> {</font>
<font color='cbcbcb'> 24</font>
<font color='000000'> }</font>
<font color='000000'> {</font>
<font color='7f7f7f'> 2</font>
<font color='7f7f7f'> 3</font>
<font color='cbcbcb'> 25</font>
<font color='000000'> }</font>
<font color='000000'> {</font>
<font color='cbcbcb'> 1</font>
<font color='7f7f7f'> 4</font>
<font color='7f7f7f'> 5</font>
<font color='000000'> }</font>
<font color='000000'> {</font>
<font color='cbcbcb'> 26</font>
<font color='000000'> }</font>
<BR>
<font color='000000'> <u>PATTERN:</font>
<font color='000000'> {9,10,} {11,} {12,13,}
</font>
<font color='000000'> =</font>
<font color='000000'> 5</font>
<font color='000000'> </u></font>
<BR>
<font color='000000'> {</font>
<font color='717171'> 9</font>
<font color='989898'> 10</font>
<font color='000000'> }</font>
<font color='000000'> {</font>
<font color='d3d3d3'> 11</font>
<font color='000000'> }</font>
<font color='000000'> {</font>
<font color='404040'> 11</font>
<font color='000000'> }</font>
<font color='000000'> {</font>
<font color='404040'> 12</font>
<font color='989898'> 13</font>
<font color='000000'> }</font>
<BR>
<font color='000000'> TOTAL LEN=</font>
<font color='000000'> 10</font>
<BR>
<BR>
</b></font></html>
在这种情况下,我想提取以下内容:
{1,} {2,3,} {4,5,}
{9,10,} {11,} {12,13,}
以下是我尝试的一些代码,但没有一个有效:
# First try
soup = BeautifulSoup('file:///H:/Approx_google_code/tiny20.html')
soup.findall('PATTERN:')
# Second try
re.search( "PATTERN:", 'file:///H:/Approx_google_code/tiny20.html')
# Third try
soup.body.findAll(text='PATTERN:')
# Forth try
soup.body.findAll(text=re.compile('PATTERN:'))
我长期以来一直坚持这个简单的问题,以至于我开始怀疑BeautifulSoup是否是正确的方向。我对HTML完全陌生,所以欢迎任何简单的解释/建议,谢谢。
我尝试了Why does bs4 return tags and then an empty list to this find_all() method?的示例,但没有结果。
答案 0 :(得分:1)
找到包含PATTERN:
文字的元素,找到font
父级并获取下一个font
同级元素:
soup = BeautifulSoup(data)
for elm in soup.find_all(text="PATTERN:"):
print elm.find_parent("font").find_next_sibling("font").get_text(strip=True)
演示:
>>> from bs4 import BeautifulSoup
>>>
>>> data = """
... <! Created by program ApproxMAP by Hye-Chung(Monica) Kum>
... <HTML><font size=5 face='Helvetica-Narrow'><b>
... <font color='000000'> Cluster Support= [Pattern=</font>
... <font color='000000'> 50</font>
... <font color='000000'> % : Variation=</font>
... <font color='000000'> 20</font>
... <font color='000000'> %]; Database Support= [Min= </font>
... <font color='000000'> 1</font>
... <font color='000000'> seq: Max=</font>
... <font color='000000'> 50</font>
... <font color='000000'> %]</font>
... <BR>
... <font color='a9a9a9'> cluster=0 size=3</font>
... <font color='000000'> =<100:</font>
... <font color='434343'> 85:</font>
... <font color='767676'> 70:</font>
... <font color='a9a9a9'> 50:</font>
... <font color='c8c8c8'> 35:</font>
... <font color='e1e1e1'> 20></font>
... <BR>
... <font color='000000'> <u>PATTERN:</font>
... <font color='000000'> {1,} {2,3,} {4,5,}
... </font>
... <font color='000000'> =</font>
... <font color='000000'> 5</font>
... <font color='000000'> </u></font>
... <BR>
... <font color='000000'> {</font>
... <font color='000000'> 1</font>
... <font color='cbcbcb'> 12</font>
... <font color='000000'> }</font>
... <font color='000000'> {</font>
... <font color='cbcbcb'> 24</font>
... <font color='000000'> }</font>
... <font color='000000'> {</font>
... <font color='7f7f7f'> 2</font>
... <font color='7f7f7f'> 3</font>
... <font color='cbcbcb'> 25</font>
... <font color='000000'> }</font>
... <font color='000000'> {</font>
... <font color='cbcbcb'> 1</font>
... <font color='7f7f7f'> 4</font>
... <font color='7f7f7f'> 5</font>
... <font color='000000'> }</font>
... <font color='000000'> {</font>
... <font color='cbcbcb'> 26</font>
... <font color='000000'> }</font>
... <BR>
... <font color='000000'> <u>PATTERN:</font>
... <font color='000000'> {9,10,} {11,} {12,13,}
... </font>
... <font color='000000'> =</font>
... <font color='000000'> 5</font>
... <font color='000000'> </u></font>
... <BR>
... <font color='000000'> {</font>
... <font color='717171'> 9</font>
... <font color='989898'> 10</font>
... <font color='000000'> }</font>
... <font color='000000'> {</font>
... <font color='d3d3d3'> 11</font>
... <font color='000000'> }</font>
... <font color='000000'> {</font>
... <font color='404040'> 11</font>
... <font color='000000'> }</font>
... <font color='000000'> {</font>
... <font color='404040'> 12</font>
... <font color='989898'> 13</font>
... <font color='000000'> }</font>
... <BR>
... <font color='000000'> TOTAL LEN=</font>
... <font color='000000'> 10</font>
... <BR>
... <BR>
... </b></font></html>
... """
>>>
>>> soup = BeautifulSoup(data)
>>>
>>> for elm in soup.find_all(text="PATTERN:"):
... print elm.find_parent("font").find_next_sibling("font").get_text(strip=True)
...
{1,} {2,3,} {4,5,}
{9,10,} {11,} {12,13,}
请注意,由于我安装了lxml
,BeautifulSoup
将其用作底层解析器。我也试过了html.parser
,它对我有用。 html5lib
与前两个不同。无论如何,明确指定解析器:
soup = BeautifulSoup(data, "lxml")
或:
soup = BeautifulSoup(data, "html.parser")