我需要提取一个包含特定字符串的行,但我的以下代码会随之提供html标记。
from BeautifulSoup import BeautifulSoup
import re
import os
import codecs
import sys
get_company = "ABB LTD"
OUTFILE = os.path.join('company', 'a', 'viewids')
soup = BeautifulSoup(open("/company/a/searches/a"))
rows = soup.findAll("table",{"id":"cos"})[0].findAll('tr')
userrows = [t for t in rows if t.findAll(text=re.compile(get_company))]
print userrows
这是我的表格格式
<table id="cos" width="500" cellpadding="3" cellspacing="0" border="1">
<tr>
<th>Company Name</th>
<th>CIK Number</th>
<th>SIC Code</th>
</tr>
<tr valign="top">
<td>A CONSULTING TEAM INC</td>
<td align="right">1040792</td>
<td align="right">7380</td>
</tr>
<tr valign="top">
<td>A J&J PHARMA CORP</td>
<td align="right">1140452</td>
<td align="right">9995</td>
</tr>
</table>
所以如果我需要A J&amp; J PHARMA CORP的CIK号怎么办?现在它给我一个这样的输出:
[<tr valign="top">
<td>A J&J PHARMA CORP</td>
<td align="right">1140452</td>
<td align="right">9995</td>
</tr>]
答案 0 :(得分:2)
import re
from BeautifulSoup import BeautifulSoup
html= '''
<table id="cos" width="500" cellpadding="3" cellspacing="0" border="1">
<tr>
<th>Company Name</th>
<th>CIK Number</th>
<th>SIC Code</th>
</tr>
<tr valign="top">
<td>A CONSULTING TEAM INC</td>
<td align="right">1040792</td>
<td align="right">7380</td>
</tr>
<tr valign="top">
<td>A J&J PHARMA CORP</td>
<td align="right">1140452</td>
<td align="right">9995</td>
</tr>
</table>
'''
soup = BeautifulSoup(html)
table = soup.find("table", {"id":"cos"})
td = table.find('td', text='A J&J PHARMA CORP')
# ^ This return text node, not td.
print(td.parent.parent.findAll('td')[1].string)
打印
1140452