我有一个巨大的html文件。我想抓一些信息。从中。基本上有2000个变量,每个变量都有一些值。我需要这种格式的这些值和变量名称 - varname1,VAL1,VAL2,... varname2,VAL1,VAL2 ... .. ..
值采用此格式 -
<h2><span lang=EN-US>Element Values</span></h2>
<p class=MsoListParagraph><span lang=EN-US style='mso-no-proof:yes'>01 = 01<o:p></o:p></span></p>
<p class=MsoListParagraph><span lang=EN-US style='mso-no-proof:yes'>02<o:p></o:p></span></p>
.
.
.
<p class=MsoListParagraph style='line-height:normal'><span lang=EN-US
style='mso-no-proof:yes'>20[true]</span></p>
<p class=MsoListParagraph style='line-height:normal'><span lang=EN-US
style='font-size:6.0pt;mso-bidi-font-size:12.0pt'><o:p> </o:p></span></p>
<h2><span lang=EN-US>Element Notes</span></h2>
我需要值01 = 01,02,...,20 [true]
变量名称始终采用此格式 -
<span style='mso-no-proof:yes'>2716</span>
即span标记内的4位数。
所以1输出可能是2716,01 = 01,02,...,20 [true]
答案 0 :(得分:0)
这将匹配所需的标记(span
标记,其中属性style
设置为mso-no-proof:yes
),然后是提取文本的问题。
from bs4 import BeautifulSoup
html = """<h2><span lang=EN-US>Element Values</span></h2>
<p class=MsoListParagraph><span lang=EN-US style='mso-no-proof:yes'>01 = 01<o:p></o:p></span></p>
<p class=MsoListParagraph><span lang=EN-US style='mso-no-proof:yes'>02<o:p></o:p></span></p>
<p class=MsoListParagraph style='line-height:normal'><span lang=EN-US
style='mso-no-proof:yes'>20[true]</span></p>
<p class=MsoListParagraph style='line-height:normal'><span lang=EN-US
style='font-size:6.0pt;mso-bidi-font-size:12.0pt'><o:p> </o:p></span></p>
<h2><span lang=EN-US>Element Notes</span></h2>"""
soup = BeautifulSoup(html)
elements = soup.find_all(name='span', attrs={'style' : 'mso-no-proof:yes'})
print ','.join(e.text for e in elements)
输出:
01 = 01,02,20[true]