我很难尝试清理一些HTML代码以获取一些特定的href
链接,以及表td
标记内的文本内容,例如日期和文本。
这是网页link。您必须点击DFP
才能访问此页面。
我只想要文字DFP - ENET - ATIVO
之后的信息。
这是HTML代码:
html_source = """
<!DOCTYPE html>
<html>
<head>
<title></title>
</head>
<body>
<table align="center" border="0" cellpadding="0" cellspacing="0" width="640">
<tbody>
<tr>
<td align="right" colspan="3"><img border="0" src="images/titulos_ciaslist_info_sobre_empr_IPEV.gif"><br>
<br>
<br>
<br></td>
</tr>
<tr>
<td colspan="3"><font class="TextoEx"><b>Código CVM : 001023<br>
Razão Social : BANCO DO BRASIL S.A.<br>
CNPJ : 00.000.000/0001-91<br>
<br>
<br>
<br>
<br></b></font></td>
</tr>
<tr class="LegendaPequenaC">
<td bgcolor="#F7F7F7" style="COLOR : 'olivedrab'" width="33%">9 documento(s) encontrado(s)</td>
<td align="center" bgcolor="#F7F7F7" style="COLOR : 'olivedrab'" width="33%">Exibindo 1 a 9</td>
<td align="right" bgcolor="#F7F7F7" style="COLOR : 'olivedrab'" width="33%"></td>
</tr>
<tr valign="top">
<td colspan="3">
<table align="center" bgcolor="#BEBEBE" border="0" cellpadding="0" cellspacing="1" width="95%"></table>
<table align="center" bgcolor="#BEBEBE" border="0" cellpadding="0" cellspacing="1" width="95%">
<tbody>
<tr class="TableOptions">
<td bgcolor="#F7F7F7" width="20%"><b>Categoria</b></td>
<td bgcolor="#FFFFFF" colspan="2" width="50%">DFP - ENET - Ativo</td>
<td align="center" bgcolor="#F7F7F7" class="LegendaPequenaC" width="15%"><b><a href="javascript:fVisualizaArquivo_ENET('57534','CONSULTA')" style="COLOR : 'olivedrab'">Consulta</a></b></td>
<td align="center" bgcolor="#F7F7F7" class="LegendaPequenaC" width="15%"><b><a href="javascript:fVisualizaArquivo_ENET('57534','DOWNLOAD')" style="COLOR : 'olivedrab'">Download</a></b></td>
</tr>
<tr class="TableOptions">
<td bgcolor="#F7F7F7"><b>Data Encerramento</b></td>
<td bgcolor="#FFFFFF">31/12/2015</td>
<td bgcolor="#F7F7F7" width="15%"><b>Data Entrega</b></td>
<td bgcolor="#FFFFFF" colspan="2" nowrap>02/06/2016 11:44</td>
</tr>
<tr class="TableOptions">
<td bgcolor="#F7F7F7"><b>Tipo Apresentação</b></td>
<td bgcolor="#FFFFFF">Reapresentação Espontânea</td>
<td bgcolor="#F7F7F7" width="15%"><b>Versão</b></td>
<td bgcolor="#FFFFFF" colspan="3" nowrap>3.0</td>
</tr>
<tr class="TableOptions">
<td bgcolor="#F7F7F7"><b>Prot. de entrega</b></td>
<td bgcolor="#FFFFFF" colspan="4">
<a href="javascript:fVisualizaProtocolo_ENET('57534','CONSULTA')"><u>001023DFP311220150300057534-67</u></a>
</td>
</tr>
</tbody>
</table><br>
<br>
<table align="center" bgcolor="#BEBEBE" border="0" cellpadding="0" cellspacing="1" width="95%"></table>
<table align="center" bgcolor="#BEBEBE" border="0" cellpadding="0" cellspacing="1" width="95%">
<tbody>
<tr class="TableOptions">
<td bgcolor="#F7F7F7" width="20%"><b>Categoria</b></td>
<td bgcolor="#FFFFFF" colspan="2" width="50%">DFP - ENET - Inativo</td>
<td align="center" bgcolor="#F7F7F7" class="LegendaPequenaC" width="15%"><b><a href="javascript:fVisualizaArquivo_ENET('54536','CONSULTA')" style="COLOR : 'olivedrab'">Consulta</a></b></td>
<td align="center" bgcolor="#F7F7F7" class="LegendaPequenaC" width="15%"><b><a href="javascript:fVisualizaArquivo_ENET('54536','DOWNLOAD')" style="COLOR : 'olivedrab'">Download</a></b></td>
</tr>
<tr class="TableOptions">
<td bgcolor="#F7F7F7"><b>Data Encerramento</b></td>
<td bgcolor="#FFFFFF">31/12/2015</td>
<td bgcolor="#F7F7F7" width="15%"><b>Data Entrega</b></td>
<td bgcolor="#FFFFFF" colspan="2" nowrap>28/03/2016 22:09</td>
</tr>
<tr class="TableOptions">
<td bgcolor="#F7F7F7"><b>Tipo Apresentação</b></td>
<td bgcolor="#FFFFFF">Reapresentação Espontânea</td>
<td bgcolor="#F7F7F7" width="15%"><b>Versão</b></td>
<td bgcolor="#FFFFFF" colspan="3" nowrap>2.0</td>
</tr>
<tr class="TableOptions">
<td bgcolor="#F7F7F7"><b>Prot. de entrega</b></td>
<td bgcolor="#FFFFFF" colspan="4">
<a href="javascript:fVisualizaProtocolo_ENET('54536','CONSULTA')"><u>001023DFP311220150200054536-63</u></a>
</td>
</tr>
</tbody>
</table><br>
<br>
<table align="center" bgcolor="#BEBEBE" border="0" cellpadding="0" cellspacing="1" width="95%"></table>
<table align="center" bgcolor="#BEBEBE" border="0" cellpadding="0" cellspacing="1" width="95%">
<tbody>
<tr class="TableOptions">
<td bgcolor="#F7F7F7" width="20%"><b>Categoria</b></td>
<td bgcolor="#FFFFFF" colspan="2" width="50%">DFP - ENET - Inativo</td>
<td align="center" bgcolor="#F7F7F7" class="LegendaPequenaC" width="15%"><b><a href="javascript:fVisualizaArquivo_ENET('53614','CONSULTA')" style="COLOR : 'olivedrab'">Consulta</a></b></td>
<td align="center" bgcolor="#F7F7F7" class="LegendaPequenaC" width="15%"><b><a href="javascript:fVisualizaArquivo_ENET('53614','DOWNLOAD')" style="COLOR : 'olivedrab'">Download</a></b></td>
</tr>
<tr class="TableOptions">
<td bgcolor="#F7F7F7"><b>Data Encerramento</b></td>
<td bgcolor="#FFFFFF">31/12/2015</td>
<td bgcolor="#F7F7F7" width="15%"><b>Data Entrega</b></td>
<td bgcolor="#FFFFFF" colspan="2" nowrap>25/02/2016 08:29</td>
</tr>
<tr class="TableOptions">
<td bgcolor="#F7F7F7"><b>Tipo Apresentação</b></td>
<td bgcolor="#FFFFFF">Apresentação</td>
<td bgcolor="#F7F7F7" width="15%"><b>Versão</b></td>
<td bgcolor="#FFFFFF" colspan="3" nowrap>1.0</td>
</tr>
<tr class="TableOptions">
<td bgcolor="#F7F7F7"><b>Prot. de entrega</b></td>
<td bgcolor="#FFFFFF" colspan="4">
<a href="javascript:fVisualizaProtocolo_ENET('53614','CONSULTA')"><u>001023DFP311220150100053614-77</u></a>
</td>
</tr>
</tbody>
</table><br>
<br>
<table align="center" bgcolor="#BEBEBE" border="0" cellpadding="0" cellspacing="1" width="95%"></table>
<table align="center" bgcolor="#BEBEBE" border="0" cellpadding="0" cellspacing="1" width="95%">
<tbody>
<tr class="TableOptions">
<td bgcolor="#F7F7F7" width="20%"><b>Categoria</b></td>
<td bgcolor="#FFFFFF" colspan="2" width="50%">DFP - ENET - Ativo</td>
<td align="center" bgcolor="#F7F7F7" class="LegendaPequenaC" width="15%"><b><a href="javascript:fVisualizaArquivo_ENET('45354','CONSULTA')" style="COLOR : 'olivedrab'">Consulta</a></b></td>
<td align="center" bgcolor="#F7F7F7" class="LegendaPequenaC" width="15%"><b><a href="javascript:fVisualizaArquivo_ENET('45354','DOWNLOAD')" style="COLOR : 'olivedrab'">Download</a></b></td>
</tr>
<tr class="TableOptions">
<td bgcolor="#F7F7F7"><b>Data Encerramento</b></td>
<td bgcolor="#FFFFFF">31/12/2014</td>
<td bgcolor="#F7F7F7" width="15%"><b>Data Entrega</b></td>
<td bgcolor="#FFFFFF" colspan="2" nowrap>27/03/2015 08:18</td>
</tr>
<tr class="TableOptions">
<td bgcolor="#F7F7F7"><b>Tipo Apresentação</b></td>
<td bgcolor="#FFFFFF">Reapresentação Espontânea</td>
<td bgcolor="#F7F7F7" width="15%"><b>Versão</b></td>
<td bgcolor="#FFFFFF" colspan="3" nowrap>2.0</td>
</tr>
<tr class="TableOptions">
<td bgcolor="#F7F7F7"><b>Prot. de entrega</b></td>
<td bgcolor="#FFFFFF" colspan="4">
<a href="javascript:fVisualizaProtocolo_ENET('45354','CONSULTA')"><u>001023DFP311220140200045354-67</u></a>
</td>
</tr>
</tbody>
</table><br>
<br>
<table align="center" bgcolor="#BEBEBE" border="0" cellpadding="0" cellspacing="1" width="95%"></table>
<table align="center" bgcolor="#BEBEBE" border="0" cellpadding="0" cellspacing="1" width="95%">
<tbody>
<tr class="TableOptions">
<td bgcolor="#F7F7F7" width="20%"><b>Categoria</b></td>
<td bgcolor="#FFFFFF" colspan="2" width="50%">DFP - ENET - Inativo</td>
<td align="center" bgcolor="#F7F7F7" class="LegendaPequenaC" width="15%"><b><a href="javascript:fVisualizaArquivo_ENET('43994','CONSULTA')" style="COLOR : 'olivedrab'">Consulta</a></b></td>
<td align="center" bgcolor="#F7F7F7" class="LegendaPequenaC" width="15%"><b><a href="javascript:fVisualizaArquivo_ENET('43994','DOWNLOAD')" style="COLOR : 'olivedrab'">Download</a></b></td>
</tr>
<tr class="TableOptions">
<td bgcolor="#F7F7F7"><b>Data Encerramento</b></td>
<td bgcolor="#FFFFFF">31/12/2014</td>
<td bgcolor="#F7F7F7" width="15%"><b>Data Entrega</b></td>
<td bgcolor="#FFFFFF" colspan="2" nowrap>11/02/2015 08:24</td>
</tr>
<tr class="TableOptions">
<td bgcolor="#F7F7F7"><b>Tipo Apresentação</b></td>
<td bgcolor="#FFFFFF">Apresentação</td>
<td bgcolor="#F7F7F7" width="15%"><b>Versão</b></td>
<td bgcolor="#FFFFFF" colspan="3" nowrap>1.0</td>
</tr>
<tr class="TableOptions">
<td bgcolor="#F7F7F7"><b>Prot. de entrega</b></td>
<td bgcolor="#FFFFFF" colspan="4">
<a href="javascript:fVisualizaProtocolo_ENET('43994','CONSULTA')"><u>001023DFP311220140100043994-74</u></a>
</td>
</tr>
</tbody>
</table><br>
<br>
<table align="center" bgcolor="#BEBEBE" border="0" cellpadding="0" cellspacing="1" width="95%"></table>
<table align="center" bgcolor="#BEBEBE" border="0" cellpadding="0" cellspacing="1" width="95%">
<tbody>
<tr class="TableOptions">
<td bgcolor="#F7F7F7" width="20%"><b>Categoria</b></td>
<td bgcolor="#FFFFFF" colspan="2" width="50%">DFP - ENET - Ativo</td>
<td align="center" bgcolor="#F7F7F7" class="LegendaPequenaC" width="15%"><b><a href="javascript:fVisualizaArquivo_ENET('41430','CONSULTA')" style="COLOR : 'olivedrab'">Consulta</a></b></td>
<td align="center" bgcolor="#F7F7F7" class="LegendaPequenaC" width="15%"><b><a href="javascript:fVisualizaArquivo_ENET('41430','DOWNLOAD')" style="COLOR : 'olivedrab'">Download</a></b></td>
</tr>
<tr class="TableOptions">
<td bgcolor="#F7F7F7"><b>Data Encerramento</b></td>
<td bgcolor="#FFFFFF">31/12/2013</td>
<td bgcolor="#F7F7F7" width="15%"><b>Data Entrega</b></td>
<td bgcolor="#FFFFFF" colspan="2" nowrap>25/09/2014 18:24</td>
</tr>
<tr class="TableOptions">
<td bgcolor="#F7F7F7"><b>Tipo Apresentação</b></td>
<td bgcolor="#FFFFFF">Reapresentação Espontânea</td>
<td bgcolor="#F7F7F7" width="15%"><b>Versão</b></td>
<td bgcolor="#FFFFFF" colspan="3" nowrap>4.0</td>
</tr>
<tr class="TableOptions">
<td bgcolor="#F7F7F7"><b>Prot. de entrega</b></td>
<td bgcolor="#FFFFFF" colspan="4">
<a href="javascript:fVisualizaProtocolo_ENET('41430','CONSULTA')"><u>001023DFP311220130400041430-77</u></a>
</td>
</tr>
</tbody>
</table><br>
<br>
<table align="center" bgcolor="#BEBEBE" border="0" cellpadding="0" cellspacing="1" width="95%"></table>
<table align="center" bgcolor="#BEBEBE" border="0" cellpadding="0" cellspacing="1" width="95%">
<tbody>
<tr class="TableOptions">
<td bgcolor="#F7F7F7" width="20%"><b>Categoria</b></td>
<td bgcolor="#FFFFFF" colspan="2" width="50%">DFP - ENET - Inativo</td>
<td align="center" bgcolor="#F7F7F7" class="LegendaPequenaC" width="15%"><b><a href="javascript:fVisualizaArquivo_ENET('35587','CONSULTA')" style="COLOR : 'olivedrab'">Consulta</a></b></td>
<td align="center" bgcolor="#F7F7F7" class="LegendaPequenaC" width="15%"><b><a href="javascript:fVisualizaArquivo_ENET('35587','DOWNLOAD')" style="COLOR : 'olivedrab'">Download</a></b></td>
</tr>
<tr class="TableOptions">
<td bgcolor="#F7F7F7"><b>Data Encerramento</b></td>
<td bgcolor="#FFFFFF">31/12/2013</td>
<td bgcolor="#F7F7F7" width="15%"><b>Data Entrega</b></td>
<td bgcolor="#FFFFFF" colspan="2" nowrap>27/03/2014 09:55</td>
</tr>
<tr class="TableOptions">
<td bgcolor="#F7F7F7"><b>Tipo Apresentação</b></td>
<td bgcolor="#FFFFFF">Reapresentação Espontânea</td>
<td bgcolor="#F7F7F7" width="15%"><b>Versão</b></td>
<td bgcolor="#FFFFFF" colspan="3" nowrap>3.0</td>
</tr>
<tr class="TableOptions">
<td bgcolor="#F7F7F7"><b>Prot. de entrega</b></td>
<td bgcolor="#FFFFFF" colspan="4">
<a href="javascript:fVisualizaProtocolo_ENET('35587','CONSULTA')"><u>001023DFP311220130300035587-73</u></a>
</td>
</tr>
</tbody>
</table><br>
<br>
<table align="center" bgcolor="#BEBEBE" border="0" cellpadding="0" cellspacing="1" width="95%"></table>
<table align="center" bgcolor="#BEBEBE" border="0" cellpadding="0" cellspacing="1" width="95%">
<tbody>
<tr class="TableOptions">
<td bgcolor="#F7F7F7" width="20%"><b>Categoria</b></td>
<td bgcolor="#FFFFFF" colspan="2" width="50%">DFP - ENET - Inativo</td>
<td align="center" bgcolor="#F7F7F7" class="LegendaPequenaC" width="15%"><b><a href="javascript:fVisualizaArquivo_ENET('34667','CONSULTA')" style="COLOR : 'olivedrab'">Consulta</a></b></td>
<td align="center" bgcolor="#F7F7F7" class="LegendaPequenaC" width="15%"><b><a href="javascript:fVisualizaArquivo_ENET('34667','DOWNLOAD')" style="COLOR : 'olivedrab'">Download</a></b></td>
</tr>
<tr class="TableOptions">
<td bgcolor="#F7F7F7"><b>Data Encerramento</b></td>
<td bgcolor="#FFFFFF">31/12/2013</td>
<td bgcolor="#F7F7F7" width="15%"><b>Data Entrega</b></td>
<td bgcolor="#FFFFFF" colspan="2" nowrap>19/02/2014 17:47</td>
</tr>
<tr class="TableOptions">
<td bgcolor="#F7F7F7"><b>Tipo Apresentação</b></td>
<td bgcolor="#FFFFFF">Reapresentação Espontânea</td>
<td bgcolor="#F7F7F7" width="15%"><b>Versão</b></td>
<td bgcolor="#FFFFFF" colspan="3" nowrap>2.0</td>
</tr>
<tr class="TableOptions">
<td bgcolor="#F7F7F7"><b>Prot. de entrega</b></td>
<td bgcolor="#FFFFFF" colspan="4">
<a href="javascript:fVisualizaProtocolo_ENET('34667','CONSULTA')"><u>001023DFP311220130200034667-63</u></a>
</td>
</tr>
</tbody>
</table><br>
<br>
<table align="center" bgcolor="#BEBEBE" border="0" cellpadding="0" cellspacing="1" width="95%"></table>
<table align="center" bgcolor="#BEBEBE" border="0" cellpadding="0" cellspacing="1" width="95%">
<tbody>
<tr class="TableOptions">
<td bgcolor="#F7F7F7" width="20%"><b>Categoria</b></td>
<td bgcolor="#FFFFFF" colspan="2" width="50%">DFP - ENET - Inativo</td>
<td align="center" bgcolor="#F7F7F7" class="LegendaPequenaC" width="15%"><b><a href="javascript:fVisualizaArquivo_ENET('34513','CONSULTA')" style="COLOR : 'olivedrab'">Consulta</a></b></td>
<td align="center" bgcolor="#F7F7F7" class="LegendaPequenaC" width="15%"><b><a href="javascript:fVisualizaArquivo_ENET('34513','DOWNLOAD')" style="COLOR : 'olivedrab'">Download</a></b></td>
</tr>
<tr class="TableOptions">
<td bgcolor="#F7F7F7"><b>Data Encerramento</b></td>
<td bgcolor="#FFFFFF">31/12/2013</td>
<td bgcolor="#F7F7F7" width="15%"><b>Data Entrega</b></td>
<td bgcolor="#FFFFFF" colspan="2" nowrap>13/02/2014 08:54</td>
</tr>
<tr class="TableOptions">
<td bgcolor="#F7F7F7"><b>Tipo Apresentação</b></td>
<td bgcolor="#FFFFFF">Apresentação</td>
<td bgcolor="#F7F7F7" width="15%"><b>Versão</b></td>
<td bgcolor="#FFFFFF" colspan="3" nowrap>1.0</td>
</tr>
<tr class="TableOptions">
<td bgcolor="#F7F7F7"><b>Prot. de entrega</b></td>
<td bgcolor="#FFFFFF" colspan="4">
<a href="javascript:fVisualizaProtocolo_ENET('34513','CONSULTA')"><u>001023DFP311220130100034513-71</u></a>
</td>
</tr>
</tbody>
</table><br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
</td>
</tr>
<tr>
<td></td>
</tr>
<tr class="LegendaPequenaC">
<td bgcolor="#F7F7F7" style="COLOR : 'olivedrab'" width="33%">9 documento(s) encontrado(s)</td>
<td align="center" bgcolor="#F7F7F7" style="COLOR : 'olivedrab'" width="33%">Exibindo 1 a 9</td>
<td align="right" bgcolor="#F7F7F7" style="COLOR : 'olivedrab'" width="33%"></td>
</tr>
<tr>
<td></td>
</tr>
</tbody>
</table>
</body>
</html>
"""
这是我的代码:
from bs4 import BeautifulSoup
#insert html_source here
soup = BeautifulSoup(html_source, 'html.parser')
table = soup.find('table')
tds = table.find_all('td', {'colspan':'2'})
for td in tds:
if td.text == 'DFP - ENET - Ativo':
print(td.find_next('href'))
当我尝试使用print(td.next_sibling())
时,我收到了以下TypeError
消息:
TypeError: 'NavigableString' object is not callable
我已阅读this question和this one,但无法使我的代码正常工作。
如果可能,我希望以下格式输出此特定HTML页面(包含3个活动项目):
[("javascript:fVisualizaArquivo_ENET('57534','CONSULTA')", "31/12/2015", "02/06/2016 11:44", "Reapresentação Espontânea", "3.0"), ("javascript:fVisualizaArquivo_ENET('45354','CONSULTA')", "31/12/2014", "27/03/2015 08:18", "Reapresentação Espontânea", "2.0"), ("javascript:fVisualizaArquivo_ENET('41430','CONSULTA')", "31/12/2013", "25/09/2014 18:24", "Reapresentação Espontânea", "4.0")]
答案 0 :(得分:1)
from bs4 import BeautifulSoup
#insert html_source here
soup = BeautifulSoup(html_source, 'html.parser')
links = [a['href']for a in soup('a', text='Download')]
Encerramento = [i.find_next('td').text for i in soup('b', text='Data Encerramento')]
Entrega = [i.find_next('td').text for i in soup('b', text='Data Entrega')]
Tipo = [i.find_next('td').text for i in soup('b', text='Tipo Apresentação')]
Versão = [i.find_next('td').text for i in soup('b', text='Versão')]
for i in zip(links, Encerramento, Entrega, Tipo, Versão):
print(i)
出:
("javascript:fVisualizaArquivo_ENET('57534','DOWNLOAD')", '31/12/2015', '02/06/2016 11:44', 'Reapresentação Espontânea', '3.0')
("javascript:fVisualizaArquivo_ENET('54536','DOWNLOAD')", '31/12/2015', '28/03/2016 22:09', 'Reapresentação Espontânea', '2.0')
("javascript:fVisualizaArquivo_ENET('53614','DOWNLOAD')", '31/12/2015', '25/02/2016 08:29', 'Apresentação', '1.0')
("javascript:fVisualizaArquivo_ENET('45354','DOWNLOAD')", '31/12/2014', '27/03/2015 08:18', 'Reapresentação Espontânea', '2.0')
("javascript:fVisualizaArquivo_ENET('43994','DOWNLOAD')", '31/12/2014', '11/02/2015 08:24', 'Apresentação', '1.0')
("javascript:fVisualizaArquivo_ENET('41430','DOWNLOAD')", '31/12/2013', '25/09/2014 18:24', 'Reapresentação Espontânea', '4.0')
("javascript:fVisualizaArquivo_ENET('35587','DOWNLOAD')", '31/12/2013', '27/03/2014 09:55', 'Reapresentação Espontânea', '3.0')
("javascript:fVisualizaArquivo_ENET('34667','DOWNLOAD')", '31/12/2013', '19/02/2014 17:47', 'Reapresentação Espontânea', '2.0')
("javascript:fVisualizaArquivo_ENET('34513','DOWNLOAD')", '31/12/2013', '13/02/2014 08:54', 'Apresentação', '1.0')
使用文本作为锚点,然后找到下一个td
标记。
有五个列表,使用zip将它们放在一起。