Python从复杂的HTML文件中抓取

时间:2014-07-04 07:43:59

标签: python web-scraping beautifulsoup

我有一个巨大的html文件。我想抓一些信息。从中。基本上有2000个变量,每个变量都有一些值。我需要这种格式的这些值和变量名称 - varname1,VAL1,VAL2,... varname2,VAL1,VAL2 ... .. ..

值采用此格式 -


<h2><span lang=EN-US>Element Values</span></h2>

<p class=MsoListParagraph><span lang=EN-US style='mso-no-proof:yes'>01 = 01<o:p></o:p></span></p>

<p class=MsoListParagraph><span lang=EN-US style='mso-no-proof:yes'>02<o:p></o:p></span></p>
.
.
.

<p class=MsoListParagraph style='line-height:normal'><span lang=EN-US
style='mso-no-proof:yes'>20[true]</span></p>

<p class=MsoListParagraph style='line-height:normal'><span lang=EN-US
 style='font-size:6.0pt;mso-bidi-font-size:12.0pt'><o:p>&nbsp;</o:p></span></p>

<h2><span lang=EN-US>Element Notes</span></h2>

我需要值01 = 01,02,...,20 [true]

变量名称始终采用此格式 -

<span style='mso-no-proof:yes'>2716</span>

即span标记内的4位数。

所以1输出可能是2716,01 = 01,02,...,20 [true]

1 个答案:

答案 0 :(得分:0)

这将匹配所需的标记(span标记,其中属性style设置为mso-no-proof:yes),然后是提取文本的问题。

from bs4 import BeautifulSoup

html = """<h2><span lang=EN-US>Element Values</span></h2>
<p class=MsoListParagraph><span lang=EN-US style='mso-no-proof:yes'>01 = 01<o:p></o:p></span></p>
<p class=MsoListParagraph><span lang=EN-US style='mso-no-proof:yes'>02<o:p></o:p></span></p>
<p class=MsoListParagraph style='line-height:normal'><span lang=EN-US
style='mso-no-proof:yes'>20[true]</span></p>
<p class=MsoListParagraph style='line-height:normal'><span lang=EN-US
 style='font-size:6.0pt;mso-bidi-font-size:12.0pt'><o:p>&nbsp;</o:p></span></p>
<h2><span lang=EN-US>Element Notes</span></h2>"""

soup = BeautifulSoup(html)
elements = soup.find_all(name='span', attrs={'style' : 'mso-no-proof:yes'})
print ','.join(e.text for e in elements)

输出:

01 = 01,02,20[true]