使用python和BeautifulSoup从html中提取表数据

时间:2014-02-27 17:14:07

标签: python html beautifulsoup

我是python和beautifulsopu lib的新手。我尝试了很多东西,但没有运气。

我的HTML代码可能是:

<form method = "post" id="FORM1" name="FORM1">
<table cellpadding=0 cellspacing=1 border=0 align="center" bgcolor="#cccccc">
   <tr>

    <td class="producto"><b>Club</b><br>

       <input value="CLUB TENIS DE MESA PORTOBAIL" disabled class="txtmascaraform" type="TEXT" name="txtClub" size="60" maxlength="55">
     </td> 
   <tr>
    <td colspan="2" class="producto"><b>Nombre Equipo</b><br>
    <input value="C.T.M. PORTOBAIL" disabled class="txtmascaraform" type="TEXT" name="txtNomEqu" size="100" maxlength="80">
    </td>
   </tr>
   <tr>
     <td class="producto"><b>Telefono fijo</b><br>
       <input value="63097005534" disabled class="txtmascaraform" type="TEXT" name="txtTelf" size="15" maxlength="10">
     </td

我只需要取得&lt;“b”&gt;&lt;“/ b”&gt;内的内容。及其“输入值”。

非常感谢!!

1 个答案:

答案 0 :(得分:0)

首先find()您的表单按ID,然后find_all()输入内部并获取value attribute的值:

from bs4 import BeautifulSoup


data = """<form method = "post" id="FORM1" name="FORM1">
<table cellpadding=0 cellspacing=1 border=0 align="center" bgcolor="#cccccc">
   <tr>

    <td class="producto"><b>Club</b><br>

       <input value="CLUB TENIS DE MESA PORTOBAIL" disabled class="txtmascaraform" type="TEXT" name="txtClub" size="60" maxlength="55">
     </td>
   <tr>
    <td colspan="2" class="producto"><b>Nombre Equipo</b><br>
    <input value="C.T.M. PORTOBAIL" disabled class="txtmascaraform" type="TEXT" name="txtNomEqu" size="100" maxlength="80">
    </td>
   </tr>
   <tr>
     <td class="producto"><b>Telefono fijo</b><br>
       <input value="63097005534" disabled class="txtmascaraform" type="TEXT" name="txtTelf" size="15" maxlength="10">
     </td>
   </tr>
</table>
</form>"""

soup = BeautifulSoup(data)
form = soup.find("form", {'id': "FORM1"})
print [item.get('value') for item in form.find_all('input')]

# UPDATE for getting table cell values
table = form.find("table")
print [item.text.strip() for item in table.find_all('td')]

打印:

['CLUB TENIS DE MESA PORTOBAIL', 'C.T.M. PORTOBAIL', '63097005534']
[u'Club', u'Nombre Equipo', u'Telefono fijo']