Question

我正在从存储在description标记HTML代码上的RSS中读取信息，但它并不是纯文本。我需要提取一些信息，比如第一张图片会出现，但是我无法做到，因为description内的所有标签都没有被Jsoup解析我认为是CDATA元素的行为。

关于我的问题，我引用了＃34;自动方式＆＃34; 因为我在这里发布的其他问题上看到我会使用.replace()删除CDATA，但它似乎不是一个有效的解决方案，因为我认为它可以用于特定情况，而不是用于普遍目的。所以我的问题是，如果有一种方法让Jsoup进行解析而不用我替换文本？这是唯一存在的方式吗？我应该使用其他库？

例如，当我解析RSS文档时，节点描述具有：

&lt;table width='100%' border='0' cellspacing='0' cellpadding='4'&gt;&lt;tr&gt;&lt;td align='left' width='10'&gt;&lt;
a href='http://www.3djuegos.com/noticia/145062/0/bioware-nuevo-juego-ip/video-gamescom/trailer/'&gt;&lt;img src='http://i11c.3djuegos.com/juegos/7332/dragon_age_iii/fotos/noticias/dragon_age_iii-2583054.jpg' border='0' width='70' height='52' /&gt;
&lt;/a&gt;&lt;/td&gt;&lt;td align='left' valign='top'&gt;Parece ser una nueva licencia creativa, seg&uacute;n lo visto en un enigm&aacu

所有特殊字符＆＃34;＆lt;＆gt;＆＃34;由于CDATA是这样的，因此被挖掘出来。文档的其余部分解析良好仅在CDATA内容中发生。

我用来访问的代码：

doc = Jsoup.connect("http://www.3djuegos.com/universo/rss/rss.php?plats=1-2-3-4-5-6-7-34&tipos=noticia-analisis-avance-video-imagenes-demo&fotos=peques&limit=20").get();
System.out.println(doc.html()); // Shows the document well parsed.

Elements nodes = doc.getElementsByTag("item"); // Access to news
for(int i = 0; i < nodes.size(); i++){ // Loop all news

    // Description node
    Element decriptionNode = nodes.get(i).getElementsByTag("description").get(0);

    // Shows content of node. Here is where HTML tags are escaped
    System.out.println(nodes.get(i).getElementsByTag("description").html()); // Here prints the content of description tag and all HTML tags are escaped by default

    // Access to first image and here fails because of description text is escaped
    // and then Jsoup cant parsed as nodes
    Element imageNode = descriptionNode.getElementsByTag("img").get(0);
}

编辑：我使用doc.outputSettings().escapeMode(EscapeMode.xhtml)，但我认为它不会影响CDATA内容。

Edit2：我使用库org.apache.commons.lang3.StringEscapeUtils作为解决方法，让我们可以使用unescape html，但我还在考虑Jsoup是否已经适应了这种情况。

Answer 1

您可以使用text()方法获取未转义的值。这意味着如果元素的值为<table width='100%' border='0' cellspacing='0' cellpadding='4'>，那么当您执行element.text()时，它会返回<table width='100%' border='0' cellspacing='0' cellpadding='4'>。因此，您可以再次解析此片段以获得您想要的任何内容。例如

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

public class Sample {
    public static void main(String[] args) throws Exception {
        String html = "<description>"
                        + "&lt;table width='100%' border='0' cellspacing='0' cellpadding='4'&gt;&lt;tr&gt;&lt;td align='left' width='10'&gt;&lt;"
                        + "a href='http://www.3djuegos.com/noticia/145062/0/bioware-nuevo-juego-ip/video-gamescom/trailer/'&gt;&lt;img src='http://i11c.3djuegos.com/juegos/7332/dragon_age_iii/fotos/noticias/dragon_age_iii-2583054.jpg' border='0' width='70' height='52' /&gt;"
                        + "&lt;/a&gt;&lt;/td&gt;&lt;td align='left' valign='top'&gt;Parece ser una nueva licencia creativa, seg&uacute;n lo visto en un enigm&aacu"
                    + "</description>";

        Document doc = Jsoup.parse(html);
        for(Element desc : doc.select("description")){
            String unescapedHtml = desc.text();
            String src = Jsoup.parse(unescapedHtml).select("img").first().attr("src");
            System.out.println(src);
        }
        System.out.println("Done");
    }

}

使用Jsoup解析CDATA内部标签的自动方式，无需替换

1 个答案: