使用jsoup java将HTML从<! - - >注释提取到结束标记

时间:2013-11-11 22:53:00

标签: java html regex jsoup

我有一些看起来像

的HTML
<!-- start content -->
<p>Blah...</p>
<dl><dd>blah</dd></dl>

我需要从评论中提取HTML到结束dl标记。关闭dl是评论后的第一个(不确定之后是否会有更多,但之前从未有过)。两者之间的HTML在长度和内容上是可变的,并且没有任何好的标识符。

我看到可以使用#comment节点选择评论本身,但是我如何从评论开始以HTML结尾标记结束?

这是我提出的,有效,但显然不是最有效的。

    String myDirectoryPath = "D:\\Path";
    File dir = new File(myDirectoryPath);
    Document myDoc;
    Pattern p = Pattern.compile("<!--\\s*start\\s*content\\s*-->([\\S\\s]*?)</\\s*dl\\s*>");
    for (File child : dir.listFiles()) {
        System.out.println(child.getAbsolutePath()); 
        File file = new File(child.getAbsolutePath());
        String charSet = "UTF-8";
        String innerHtml = Jsoup.parse(file,charSet).select("body").html();
        Matcher m = p.matcher(innerHtml);
        if (m.find()) {
            Document doc = Jsoup.parse(m.group(1)); 
            String myText = doc.text();
            try {
                PrintWriter out = new PrintWriter(new BufferedWriter(new FileWriter("D:\\Path\\combined.txt", true)));
                out.println(myText);
                out.close();
            } catch (IOException e) {
                //error                }
        }
    }

2 个答案:

答案 0 :(得分:2)

使用正则表达式,也许是简单的

 #  "<!--\\s*start\\s*content\\s*-->([\\S\\s]*?)</\\s*dl\\s*>"

 <!-- \s* start \s* content \s* -->
 ([\S\s]*?) 
 </ \s* dl \s* >

答案 1 :(得分:2)

以下是一些示例代码 - 可能需要进一步改进 - 具体取决于您的目标。

final String html = "<p>abc</p>" // Additional tag before the comment
        + "<!-- start content -->\n"
        + "<p>Blah...</p>\n"
        + "<dl><dd>blah</dd></dl>"
        + "<p>def</p>"; // Additional tag after the comment

// Since it's not a full Html document (header / body), you may use a XmlParser
Document doc = Jsoup.parse(html, "", Parser.xmlParser());


for( Node node : doc.childNodes() ) // Iterate over all elements in the document
{
    if( node.nodeName().equals("#comment") ) // if it's a comment we do something
    {
        // Some output for testing ...
        System.out.println("=== Comment =======");
        System.out.println(node.toString().trim()); // 'toString().trim()' is only out beautify
        System.out.println("=== Childs ========");


        // Get the childs of the comment --> following nodes
        final List<Node> childNodes = node.siblingNodes();

        // Start- and endindex for the sublist - this is used to skip tags before the actual comment node
        final int startIdx = node.siblingIndex();   // Start index - start after (!) the comment node
        final int endIdx = childNodes.size();       // End index - the last following node

        // Iterate over all nodes, following after the comment
        for( Node child : childNodes.subList(startIdx, endIdx) )
        {
            /*
             * Do whatever you have to do with the nodes here ...
             * In this example, they are only used as Element's (Html Tags)
             */
            if( child instanceof Element )
            {
                Element element = (Element) child;

                /*
                 * Do something with your elements / nodes here ...
                 * 
                 * You can skip e.g. 'p'-tag by checking tagnames.
                 */
                System.out.println(element);

                // Stop after processing 'dl'-tag (= closing 'dl'-tag)
                if( element.tagName().equals("dl") )
                {
                    System.out.println("=== END ===========");
                    break;
                }
            }
        }
    }
}

为了便于理解,代码非常详细,您可以在某些时候缩短它。

最后,这是这个例子的输出:

=== Comment =======
<!-- start content -->
=== Childs ========
<p>Blah...</p>
<dl>
 <dd>
  blah
 </dd>
</dl>
=== END ===========

顺便说一下。要获取评论的文本,只需将其转换为Comment

String commentText = ((Comment) node).getData();