Question

我正在尝试使用jTidy从（真实世界）HTML中提取数据。但是，jTidy不会解析自定义标记。

<html>
  <body>
    <myCustomTag>some text</myCustomTag>
    <anotherCustom>more text</anotherCustom>
  </body>
</html>

我无法在自定义标签之间获取文本。我必须使用jTidy，因为我将使用xpath。

我尝试过HTMLCleaner，但它不支持完整的xpath函数。

Answer 1

您还可以使用Java Properties对象设置属性，例如：

import java.util.Properties;
Properties oProps = new Properties();
oProps.setProperty("new-blocklevel-tags", "header hgroup article footer nav");

Tidy tidy = new Tidy();
tidy.setConfigurationFromProps(oProps);

这样可以节省您创建和加载配置文件的费用。

Answer 2

查看http://tidy.sourceforge.net/docs/quickref.html#new-blocklevel-tags

它的快速和肮脏是创建一个文件，我命名为我的jTidyTags并调用：

Tidy tidy = new Tidy();
tidy.setConfigurationFromFile("jTidyTags");

之后它会发出一个警告，说它不符合W3C但是谁在乎。这将让你解析文件。

jTidyTags的一个例子是：

new-blocklevel-tags: myCustomTag anotherCustom

希望这有帮助！

如何向JTidy添加新标签？

2 个答案: