首先,我读过Parsing a table with rowspan and colspan。我甚至回答了这个问题。请在将此标记为重复之前阅读。
<table border="1">
<tr>
<th>A</th>
<th>B</th>
</tr>
<tr>
<td rowspan="2">C</td>
<td rowspan="1">D</td>
</tr>
<tr>
<td>E</td>
<td>F</td>
</tr>
<tr>
<td>G</td>
<td>H</td>
</tr>
</table>
它将呈现为
+---+---+---+
| A | B | |
+---+---+ |
| | D | |
+ C +---+---+
| | E | F |
+---+---+---+
| G | H | |
+---+---+---+
<table border="1">
<tr>
<th>A</th>
<th>B</th>
</tr>
<tr>
<td rowspan="2">C</td>
<td rowspan="2">D</td>
</tr>
<tr>
<td>E</td>
<td>F</td>
</tr>
<tr>
<td>G</td>
<td>H</td>
</tr>
</table>
然而,这会像这样呈现。
+---+---+-------+
| A | B | |
+---+---+-------+
| | | |
| C | D +---+---+
| | | E | F |
+---+---+---+---+
| G | H | |
+---+---+---+---+
我之前回答的代码只能解析包含第一行中定义的所有列的表。
def table_to_2d(table_tag):
rows = table_tag("tr")
cols = rows[0](["td", "th"])
table = [[None] * len(cols) for _ in range(len(rows))]
for row_i, row in enumerate(rows):
for col_i, col in enumerate(row(["td", "th"])):
insert(table, row_i, col_i, col)
return table
def insert(table, row, col, element):
if row >= len(table) or col >= len(table[row]):
return
if table[row][col] is None:
value = element.get_text()
table[row][col] = value
if element.has_attr("colspan"):
span = int(element["colspan"])
for i in range(1, span):
table[row][col+i] = value
if element.has_attr("rowspan"):
span = int(element["rowspan"])
for i in range(1, span):
table[row+i][col] = value
else:
insert(table, row, col + 1, element)
soup = BeautifulSoup('''
<table>
<tr><th>1</th><th>2</th><th>5</th></tr>
<tr><td rowspan="2">3</td><td colspan="2">4</td></tr>
<tr><td>6</td><td>7</td></tr>
</table>''', 'html.parser')
print(table_to_2d(soup.table))
我的问题是如何将表解析为2D数组,表示完全它在浏览器中的呈现方式。或者有人可以解释浏览器如何呈现表格也没问题。
答案 0 :(得分:12)
您不能只计算td
或th
个单元格,不能。您必须在表中扫描以获取每行的列数,并将前一行中任何活动的行扫描添加到该计数。
在different scenario parsing a table with rowspans我跟踪每列数的行数,以确保来自不同单元格的数据最终在正确的列中。这里可以使用类似的技术。
首先计数列;只保留最高的数字。保留行数跨度为2或更大的列表,并为每个列处理的每行减去1。这样你就知道每行有多少'额外'列。获取最高列数以构建输出矩阵。
接下来,再次循环遍历行和单元格,这次跟踪字典中从列号到活动计数的行间距。再次,对任何值为2或更高的任何事物都要进行下一行。然后移动列号以考虑任何活动的行扫描;如果第0列上的行数有效,则行中的第一个td
实际上是第二个
您的代码会重复将跨区列和行的值复制到输出中;我通过在给定单元格的colspan
和rowspan
个数字上创建一个循环(每个默认为1)来多次复制该值,从而实现了相同的目标。我忽略了重叠的细胞; HTML table specifications状态重叠单元格是一个错误,由用户代理解决冲突。在下面的代码中,colspan胜过了rowpan单元格。
from itertools import product
def table_to_2d(table_tag):
rowspans = [] # track pending rowspans
rows = table_tag.find_all('tr')
# first scan, see how many columns we need
colcount = 0
for r, row in enumerate(rows):
cells = row.find_all(['td', 'th'], recursive=False)
# count columns (including spanned).
# add active rowspans from preceding rows
# we *ignore* the colspan value on the last cell, to prevent
# creating 'phantom' columns with no actual cells, only extended
# colspans. This is achieved by hardcoding the last cell width as 1.
# a colspan of 0 means “fill until the end” but can really only apply
# to the last cell; ignore it elsewhere.
colcount = max(
colcount,
sum(int(c.get('colspan', 1)) or 1 for c in cells[:-1]) + len(cells[-1:]) + len(rowspans))
# update rowspan bookkeeping; 0 is a span to the bottom.
rowspans += [int(c.get('rowspan', 1)) or len(rows) - r for c in cells]
rowspans = [s - 1 for s in rowspans if s > 1]
# it doesn't matter if there are still rowspan numbers 'active'; no extra
# rows to show in the table means the larger than 1 rowspan numbers in the
# last table row are ignored.
# build an empty matrix for all possible cells
table = [[None] * colcount for row in rows]
# fill matrix from row data
rowspans = {} # track pending rowspans, column number mapping to count
for row, row_elem in enumerate(rows):
span_offset = 0 # how many columns are skipped due to row and colspans
for col, cell in enumerate(row_elem.find_all(['td', 'th'], recursive=False)):
# adjust for preceding row and colspans
col += span_offset
while rowspans.get(col, 0):
span_offset += 1
col += 1
# fill table data
rowspan = rowspans[col] = int(cell.get('rowspan', 1)) or len(rows) - row
colspan = int(cell.get('colspan', 1)) or colcount - col
# next column is offset by the colspan
span_offset += colspan - 1
value = cell.get_text()
for drow, dcol in product(range(rowspan), range(colspan)):
try:
table[row + drow][col + dcol] = value
rowspans[col + dcol] = rowspan
except IndexError:
# rowspan or colspan outside the confines of the table
pass
# update rowspan bookkeeping
rowspans = {c: s - 1 for c, s in rowspans.items() if s > 1}
return table
这会正确解析您的样本表:
>>> from pprint import pprint
>>> pprint(table_to_2d(soup.table), width=30)
[['1', '2', '5'],
['3', '4', '4'],
['3', '6', '7']]
并处理你的其他例子;第一张表:
>>> table1 = BeautifulSoup('''
... <table border="1">
... <tr>
... <th>A</th>
... <th>B</th>
... </tr>
... <tr>
... <td rowspan="2">C</td>
... <td rowspan="1">D</td>
... </tr>
... <tr>
... <td>E</td>
... <td>F</td>
... </tr>
... <tr>
... <td>G</td>
... <td>H</td>
... </tr>
... </table>''', 'html.parser')
>>> pprint(table_to_2d(table1.table), width=30)
[['A', 'B', None],
['C', 'D', None],
['C', 'E', 'F'],
['G', 'H', None]]
第二个:
>>> table2 = BeautifulSoup('''
... <table border="1">
... <tr>
... <th>A</th>
... <th>B</th>
... </tr>
... <tr>
... <td rowspan="2">C</td>
... <td rowspan="2">D</td>
... </tr>
... <tr>
... <td>E</td>
... <td>F</td>
... </tr>
... <tr>
... <td>G</td>
... <td>H</td>
... </tr>
... </table>
... ''', 'html.parser')
>>> pprint(table_to_2d(table2.table), width=30)
[['A', 'B', None, None],
['C', 'D', None, None],
['C', 'D', 'E', 'F'],
['G', 'H', None, None]]
最后但并非最不重要的是,代码正确处理超出实际表的跨度,以及"0"
跨度(延伸到末尾),如下例所示:
<table border="1">
<tr>
<td rowspan="3">A</td>
<td rowspan="0">B</td>
<td>C</td>
<td colspan="2">D</td>
</tr>
<tr>
<td colspan="0">E</td>
</tr>
</table>
有两行4个单元格,即使你认为rowpan和colspan值可能有3和5:
+---+---+---+---+
| | | C | D |
| A | B +---+---+
| | | E |
+---+---+-------+
这样的过度处理就像浏览器一样处理;它们被忽略,0跨度扩展到剩余的行或列:
>>> span_demo = BeautifulSoup('''
... <table border="1">
... <tr>
... <td rowspan="3">A</td>
... <td rowspan="0">B</td>
... <td>C</td>
... <td colspan="2">D</td>
... </tr>
... <tr>
... <td colspan="0">E</td>
... </tr>
... </table>''', 'html.parser')
>>> pprint(table_to_2d(span_demo.table), width=30)
[['A', 'B', 'C', 'D'],
['A', 'B', 'E', 'E']]
答案 1 :(得分:2)
重要的是,Martijn Pieters 解决方案无法解决单元格同时具有rowpan和colspan属性的情况。 例如
<table border="1">
<tr>
<td rowspan="3" colspan="3">A</td>
<td>B</td>
<td>C</td>
<td>D</td>
</tr>
<tr>
<td colspan="3">E</td>
</tr>
<tr>
<td colspan="1">E</td>
<td>C</td>
<td>C</td>
</tr>
<tr>
<td colspan="1">E</td>
<td>C</td>
<td>C</td>
<td>C</td>
<td>C</td>
<td>C</td>
</tr>
</table>
此表呈现为
+-----------+---+---+---+
| A | B | C | D |
| +---+---+---+
| | E |
| +---+---+---+
| | E | C | C |
+---+---+---+---+---+---+
| E | C | C | C | C | C |
+---+---+---+---+---+---+
但是如果我们应用该函数,我们将得到
[['A', 'A', 'A', 'B', 'C', 'D'],
['A', 'E', 'E', 'E', None, None],
['A', 'E', 'C', 'C', None, None],
['E', 'C', 'C', 'C', 'C', 'C']]
可能存在一些极端情况,但是将行跨簿记扩展到行跨和列跨product
中的单元格,即
for drow, dcol in product(range(rowspan), range(colspan)):
try:
table[row + drow][col + dcol] = value
rowspans[col + dcol] = rowspan
except IndexError:
# rowspan or colspan outside the confines of the table
pass
似乎可以处理该线程中的示例,对于上面的表,它将输出
[['A', 'A', 'A', 'B', 'C', 'D'],
['A', 'A', 'A', 'E', 'E', 'E'],
['A', 'A', 'A', 'E', 'C', 'C'],
['E', 'C', 'C', 'C', 'C', 'C']]
答案 2 :(得分:0)
使用通常的遍历方法只是将解析器类型更改为lxml。
soup = BeautifulSoup(resp.text, "lxml")
现在采用常规的解析方式。