如何将HTML表格与colspan和rowspan 转换为Java中的2d数组(martix)?
我在Python和jQuery中找到了很好的解决方案但在Java中找不到(只有非常简单的表通过jsoup)。有一个非常好的XSLT解决方案,但由于格式错误的输入HTML文件,我不适合。
输入表示例:
<body>
<table border="1">
<tr><td>H1</td><td colspan="2">H2</td><tr>
<tr><td></td><td>SubH2_1</td><td>SubH2_2</td><tr>
<tr><td rowspan="3">A1</td><td>B1</td><td rowspan="2">C1</td></tr>
<tr><td rowspan="2">B2</td></tr>
<tr><td>C3</td></tr>
<tr><td>C4</td><td>C5</td><td>C6</td></tr>
<tr><td>D7</td><td colspan="2">D9</td></tr>
<tr><td colspan="3">Notes</td></tr>
</table>
</body>
期望的输出:
[['H1', 'H2', 'H2'],
['', 'SubH2_1', 'SubH2_2'],
['A1', 'B1', 'C1'],
['A1', 'B2', 'C3'],
['C4', 'C5', 'C6'],
['D7', 'D9', 'D9'],
['Notes', 'Notes', 'Notes']]
答案 0 :(得分:2)
我找到了一种使用Jsoup和Java 8 Stream API:
的方法//given:
final InputStream html = getClass().getClassLoader().getResourceAsStream("table.html");
//when:
final Document document = Jsoup.parse(html, "UTF-8", "/");
final List<List<String>> result = document.select("table tr")
.stream()
// Select all <td> tags in single row
.map(tr -> tr.select("td"))
// Repeat n-times those <td> that have `colspan="n"` attribute
.map(rows -> rows.stream()
.map(td -> Collections.nCopies(td.hasAttr("colspan") ? Integer.valueOf(td.attr("colspan")) : 1, td))
.flatMap(Collection::stream)
.collect(Collectors.toList())
)
// Fold final structure to 2D List<List<Element>>
.reduce(new ArrayList<List<Element>>(), (acc, row) -> {
// First iteration - just add current row to a final structure
if (acc.isEmpty()) {
acc.add(row);
return acc;
}
// If last array in 2D array does not contain element with `rowspan` - append current
// row and skip to next iteration step
final List<Element> last = acc.get(acc.size() - 1);
if (last.stream().noneMatch(td -> td.hasAttr("rowspan"))) {
acc.add(row);
return acc;
}
// In this case last array in 2D array contains an element with `rowspan` - we are going to
// add this element n-times to current rows where n == rowspan - 1
final AtomicInteger index = new AtomicInteger(0);
last.stream()
// Map to a helper list of (index in array, rowspan value or 0 if not present, Jsoup element)
.map(td -> Arrays.asList(index.getAndIncrement(), Integer.valueOf(td.hasAttr("rowspan") ? td.attr("rowspan") : "0"), td))
// Filter out all elements without rowspan
.filter(it -> ((int) it.get(1)) > 1)
// Add all elements with rowspan to current row at the index they are present
// (add them with `rowspan="n-1"`)
.forEach(it -> {
final int idx = (int) it.get(0);
final int rowspan = (int) it.get(1);
final Element td = (Element) it.get(2);
row.add(idx, rowspan - 1 == 0 ? (Element) td.removeAttr("rowspan") : td.attr("rowspan", String.valueOf(rowspan - 1)));
});
acc.add(row);
return acc;
}, (a, b) -> a)
.stream()
// Extract inner HTML text from Jsoup elements in 2D array
.map(tr -> tr.stream()
.map(Element::text)
.collect(Collectors.toList())
)
.collect(Collectors.toList());
我添加了很多评论来解释在特定算法步骤中会发生什么。
在这个例子中,我使用了以下html文件:
<body>
<table border="1">
<tr><td>H1</td><td colspan="2">H2</td></tr>
<tr><td></td><td>SubH2_1</td><td>SubH2_2</td></tr>
<tr><td rowspan="2">A1</td><td>B1</td><td>C1</td></tr>
<tr><td>B2</td><td>C3</td></tr>
<tr><td>C4</td><td>C5</td><td>C6</td></tr>
<tr><td>D7</td><td colspan="2">D9</td></tr>
<tr><td colspan="3">Notes</td></tr>
</table>
</body>
它与您的相同,唯一的区别是rowspan
使用率固定 - 在您的示例中A1
重复三次而不是两次。此示例中的两个<tr>
也正确关闭,否则在最终结构中会显示另外两个空数组。
这是控制台输出:
[H1, H2, H2]
[, SubH2_1, SubH2_2]
[A1, B1, C1]
[A1, B2, C3]
[C4, C5, C6]
[D7, D9, D9]
[Notes, Notes, Notes]
您可以在问题中粘贴时使用精确的HTML运行此示例,它会产生一些不同的输出:
[H1, H2, H2]
[]
[, SubH2_1, SubH2_2]
[]
[A1, B1, C1]
[A1, B2, C1]
[A1, B2, C3]
[C4, C5, C6]
[D7, D9, D9]
[Notes, Notes, Notes]
这些空数组显示,因为HTML中有两个未关闭的<tr>
元素。
<tr><td>H1</td><td colspan="2">H2</td><tr>
<tr><td></td><td>SubH2_1</td><td>SubH2_2</td><tr>
关闭它们并再次运行算法将创建以下输出:
[H1, H2, H2]
[, SubH2_1, SubH2_2]
[A1, B1, C1]
[A1, B2, C1]
[A1, B2, C3]
[C4, C5, C6]
[D7, D9, D9]
[Notes, Notes, Notes]
正如您所见,A1
存在3次,因为它有rowspan="3"
属性,B2
有rowspan="2"
而C1
有rowspan="2"
为好。它生成的HTML与我的第一个示例中的“几乎”相同,但是当您仔细查看这3行时,您将看到它们不在同一像素级别。按照您的预期响应,我已修复输入HTML以使其外观和行为符合您的预期。
好吧,如果您无法修改输入HTML,则必须:
<tr>
标记A1
,B2
和C3
的输出预期 - HTML视图未显示以HTML格式编写的此表的确切结构。在这里,您可以找到我用于找到问题答案的JUnit测试的full source code。随意下载GitHub上托管的this sample Maven project来解决算法的实现问题。
我希望它有所帮助。