我有一些看起来像
的HTML<!-- start content -->
<p>Blah...</p>
<dl><dd>blah</dd></dl>
我需要从评论中提取HTML到结束dl标记。关闭dl是评论后的第一个(不确定之后是否会有更多,但之前从未有过)。两者之间的HTML在长度和内容上是可变的,并且没有任何好的标识符。
我看到可以使用#comment节点选择评论本身,但是我如何从评论开始以HTML结尾标记结束?
这是我提出的,有效,但显然不是最有效的。
String myDirectoryPath = "D:\\Path";
File dir = new File(myDirectoryPath);
Document myDoc;
Pattern p = Pattern.compile("<!--\\s*start\\s*content\\s*-->([\\S\\s]*?)</\\s*dl\\s*>");
for (File child : dir.listFiles()) {
System.out.println(child.getAbsolutePath());
File file = new File(child.getAbsolutePath());
String charSet = "UTF-8";
String innerHtml = Jsoup.parse(file,charSet).select("body").html();
Matcher m = p.matcher(innerHtml);
if (m.find()) {
Document doc = Jsoup.parse(m.group(1));
String myText = doc.text();
try {
PrintWriter out = new PrintWriter(new BufferedWriter(new FileWriter("D:\\Path\\combined.txt", true)));
out.println(myText);
out.close();
} catch (IOException e) {
//error }
}
}
答案 0 :(得分:2)
使用正则表达式,也许是简单的
# "<!--\\s*start\\s*content\\s*-->([\\S\\s]*?)</\\s*dl\\s*>"
<!-- \s* start \s* content \s* -->
([\S\s]*?)
</ \s* dl \s* >
答案 1 :(得分:2)
以下是一些示例代码 - 可能需要进一步改进 - 具体取决于您的目标。
final String html = "<p>abc</p>" // Additional tag before the comment
+ "<!-- start content -->\n"
+ "<p>Blah...</p>\n"
+ "<dl><dd>blah</dd></dl>"
+ "<p>def</p>"; // Additional tag after the comment
// Since it's not a full Html document (header / body), you may use a XmlParser
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
for( Node node : doc.childNodes() ) // Iterate over all elements in the document
{
if( node.nodeName().equals("#comment") ) // if it's a comment we do something
{
// Some output for testing ...
System.out.println("=== Comment =======");
System.out.println(node.toString().trim()); // 'toString().trim()' is only out beautify
System.out.println("=== Childs ========");
// Get the childs of the comment --> following nodes
final List<Node> childNodes = node.siblingNodes();
// Start- and endindex for the sublist - this is used to skip tags before the actual comment node
final int startIdx = node.siblingIndex(); // Start index - start after (!) the comment node
final int endIdx = childNodes.size(); // End index - the last following node
// Iterate over all nodes, following after the comment
for( Node child : childNodes.subList(startIdx, endIdx) )
{
/*
* Do whatever you have to do with the nodes here ...
* In this example, they are only used as Element's (Html Tags)
*/
if( child instanceof Element )
{
Element element = (Element) child;
/*
* Do something with your elements / nodes here ...
*
* You can skip e.g. 'p'-tag by checking tagnames.
*/
System.out.println(element);
// Stop after processing 'dl'-tag (= closing 'dl'-tag)
if( element.tagName().equals("dl") )
{
System.out.println("=== END ===========");
break;
}
}
}
}
}
为了便于理解,代码非常详细,您可以在某些时候缩短它。
最后,这是这个例子的输出:
=== Comment =======
<!-- start content -->
=== Childs ========
<p>Blah...</p>
<dl>
<dd>
blah
</dd>
</dl>
=== END ===========
顺便说一下。要获取评论的文本,只需将其转换为Comment
:
String commentText = ((Comment) node).getData();