在没有html标记的情况下在html文件中提取一行

时间:2013-12-14 06:24:03

标签: python html beautifulsoup

我需要提取一个包含特定字符串的行,但我的以下代码会随之提供html标记。

from BeautifulSoup import BeautifulSoup
import re
import os
import codecs
import sys


get_company = "ABB LTD"


OUTFILE = os.path.join('company', 'a', 'viewids')

soup = BeautifulSoup(open("/company/a/searches/a"))
rows = soup.findAll("table",{"id":"cos"})[0].findAll('tr')
userrows = [t for t in rows if t.findAll(text=re.compile(get_company))]
print userrows

这是我的表格格式

<table id="cos" width="500" cellpadding="3" cellspacing="0" border="1">
  <tr>
    <th>Company Name</th>
    <th>CIK Number</th>
    <th>SIC Code</th>
  </tr>
  <tr valign="top">
    <td>A CONSULTING TEAM INC</td>
    <td align="right">1040792</td>
    <td align="right">7380</td>
  </tr>
  <tr valign="top">
    <td>A J&amp;J PHARMA CORP</td>
    <td align="right">1140452</td>
    <td align="right">9995</td>
  </tr>
</table>

所以如果我需要A J&amp; J PHARMA CORP的CIK号怎么办?现在它给我一个这样的输出:

[<tr valign="top">
    <td>A J&amp;J PHARMA CORP</td>
    <td align="right">1140452</td>
    <td align="right">9995</td>
  </tr>]

1 个答案:

答案 0 :(得分:2)

import re
from BeautifulSoup import BeautifulSoup

html= '''
<table id="cos" width="500" cellpadding="3" cellspacing="0" border="1">
  <tr>
    <th>Company Name</th>
    <th>CIK Number</th>
    <th>SIC Code</th>
  </tr>
  <tr valign="top">
    <td>A CONSULTING TEAM INC</td>
    <td align="right">1040792</td>
    <td align="right">7380</td>
  </tr>
  <tr valign="top">
    <td>A J&amp;J PHARMA CORP</td>
    <td align="right">1140452</td>
    <td align="right">9995</td>
  </tr>
</table>
'''

soup = BeautifulSoup(html)
table = soup.find("table", {"id":"cos"})
td = table.find('td', text='A J&amp;J PHARMA CORP')
# ^ This return text node, not td.
print(td.parent.parent.findAll('td')[1].string)

打印

1140452