Question

我想从包含多个表的html文档中提取一个特定的表，但遗憾的是没有标识符。但是有一个表格标题。我似乎无法弄明白。

这是一个示例html文件

<BODY>
<TABLE>
<TH>
<H3>    <BR>TABLE 1    </H3>
</TH>
<TR>
<TD>Data 1    </TD>
<TD>Data 2    </TD>
</TR>
<TR>
<TD>Data 3    </TD>
<TD>Data 4    </TD>
</TR>
<TR>
<TD>Data 5    </TD>
<TD>Data 6    </TD>
</TR>
</TABLE>

<TABLE>
<TH>
<H3>    <BR>TABLE 2    </H3>
</TH>
<TR>
<TD>Data 7    </TD>
<TD>Data 8    </TD>
</TR>
<TR>
<TD>Data 9    </TD>
<TD>Data 10    </TD>
</TR>
<TR>
<TD>Data 11    </TD>
<TD>Data 12    </TD>
</TR>
</TABLE>
</BODY>

我可以使用beautifulSoup 4来获取id或name的表格，但我只需要一个只能通过位置识别的表格。

我知道我可以获得第一张表：

tmp = f.read()
soup = BeautifulSoup(tmp) ## make it readable
table = soup.find('table') ### gets first table

但我怎样才能获得第二张桌子？

Answer 1

您可以依赖表格标题。

通过文本查找元素将函数作为text argument值传递，然后获取parent：

table_name = "TABLE 1" table = soup.find(text=lambda x: x and table_name in x).find_parent('table')

Answer 2

如果它只能通过位置识别，这意味着它始终是网站的第二个表格，您可以这样做：

tmp = f.read()
soup = BeautifulSoup(tmp)

# this will return the second table from the website
all_tables = soup.find_all('table')
second_table = all_tables[1]

使用Python从html按位置提取表

2 个答案: