从<p>元素</p>中提取标记实体

时间:2015-04-18 04:57:58

标签: java jsoup

我的数据集具有以下结构:

<p>The <ORGANIZATION>Peter Hall Company</ORGANIZATION>'s production of ''Blithe Spirit,'' directed by <PERSON>Thea Sharrock</PERSON>, is one of those attractively and unimaginatively upholstered productions of brittle classics that become must-have middlebrow tickets every few years. Most notable for <PERSON>Penelope Keith</PERSON>'s startlingly brisk and no-nonsense interpretation of the madcap medium <ORGANIZATION>Madame Arcati</ORGANIZATION>, Ms. <PERSON>Sharrock</PERSON>'s take on <PERSON>Coward</PERSON>'s 1941 comedy of a man visited by his dead wife's impish spirit delivers bright badinage, dazed double takes and marital melees at the same efficient clip.</p>

正如您所看到的,在<p>&amp;标签内。 </p>有多个标记的实体,例如<ORGANIZATION>Peter Hall Company</ORGANIZATION><PERSON>Penelope Keith</PERSON>

使用jsoup我想列出<p>个标签中包含的所有实体。

我想jsoup应该能够解决这个问题,我已经看到了一些与特定情况有关的问题,但是我无法让它们在我的情况下工作,这可能是因为<ORGANIZATION><PERSON>不是真正的HTML标签吗?我必须使用正则表达式吗?如果我能用jsoup做,怎么做?

到目前为止我试过这个:

    for (Iterator<Element> iterator = contents.iterator(); iterator.hasNext();)
    {
        Element content = iterator.next();
        String text = content.text();
        String title = content.select("PERSON").text();
        String output = text.replaceFirst(title, "").trim();
        System.out.println(output);
    }

和此:

    for (Element content : contents) 
    {
        String PERSON = content.attr("PERSON");
        String linkText = content.text();

        //print
        System.out.println(PERSON);
        System.out.println(linkText);
    }

两者都不起作用。

2 个答案:

答案 0 :(得分:2)

您只需要使用css选择器:

public class Foo {
    public static void main(String... args) {
        String xml = "<p>The <ORGANIZATION>Peter Hall Company</ORGANIZATION>'s production of ''Blithe Spirit,'' directed by <PERSON>Thea Sharrock</PERSON>, is one of those attractively and unimaginatively upholstered productions of brittle classics that become must-have middlebrow tickets every few years. Most notable for <PERSON>Penelope Keith</PERSON>'s startlingly brisk and no-nonsense interpretation of the madcap medium <ORGANIZATION>Madame Arcati</ORGANIZATION>, Ms. <PERSON>Sharrock</PERSON>'s take on <PERSON>Coward</PERSON>'s 1941 comedy of a man visited by his dead wife's impish spirit delivers bright badinage, dazed double takes and marital melees at the same efficient clip.</p>";
        Document doc = Jsoup.parse(xml);

        for (Element e: doc.select("p > ORGANIZATION, p > PERSON")) {
            System.out.printf("-> %s: %s\n", e.tagName(), e.text());
        }
    }
}

输出:

-> organization: Peter Hall Company
-> person: Thea Sharrock
-> person: Penelope Keith
-> organization: Madame Arcati
-> person: Sharrock
-> person: Coward

编辑:如果你想过滤掉这些标签并保留内容,你可以在迭代它们时用文本内容替换元素,如下所示:

public class Foo {
    public static void main(String... args) {
        String xml = "<p>The <ORGANIZATION>Peter Hall Company</ORGANIZATION>'s production of ''Blithe Spirit,'' directed by <PERSON>Thea Sharrock</PERSON>, is one of those attractively and unimaginatively upholstered productions of brittle classics that become must-have middlebrow tickets every few years. Most notable for <PERSON>Penelope Keith</PERSON>'s startlingly brisk and no-nonsense interpretation of the madcap medium <ORGANIZATION>Madame Arcati</ORGANIZATION>, Ms. <PERSON>Sharrock</PERSON>'s take on <PERSON>Coward</PERSON>'s 1941 comedy of a man visited by his dead wife's impish spirit delivers bright badinage, dazed double takes and marital melees at the same efficient clip.</p>";
        Document doc = Jsoup.parse(xml);

        for (Element e: doc.select("p > ORGANIZATION, p > PERSON")) {
            System.out.printf("-> %s: %s\n", e.tagName(), e.text());
            e.replaceWith(new TextNode(e.text(), ""));
        }

        System.out.println("\nFiltered out:\n" + doc.select("p").html());
    }
}

输出:

-> organization: Peter Hall Company
-> person: Thea Sharrock
-> person: Penelope Keith
-> organization: Madame Arcati
-> person: Sharrock
-> person: Coward

Filtered out:
The Peter Hall Company's production of ''Blithe Spirit,'' directed by Thea Sharrock, is one of those attractively and unimaginatively upholstered productions of brittle classics that become must-have middlebrow tickets every few years. Most notable for Penelope Keith's startlingly brisk and no-nonsense interpretation of the madcap medium Madame Arcati, Ms. Sharrock's take on Coward's 1941 comedy of a man visited by his dead wife's impish spirit delivers bright badinage, dazed double takes and marital melees at the same efficient clip.

答案 1 :(得分:0)

这有效,但不够优雅

            //people
            Elements contents_person = doc.getElementsByTag("p").select("PERSON");

            for (Element content : contents_person) 
            {
                //String PERSON = content.attr("PERSON");
                String linkText = content.text();

                //print
                //System.out.println(PERSON);
                System.out.println(linkText);
            }

            //places
            Elements contents_place = doc.getElementsByTag("p").select("LOCATION");

            for (Element content : contents_place) 
            {
                //String PERSON = content.attr("PERSON");
                String linkText = content.text();

                //print
                //System.out.println(PERSON);
                System.out.println(linkText);
            }

            //things
            Elements contents_things = doc.getElementsByTag("p").select("ORGANIZATION");

            for (Element content : contents_things) 
            {
                //String PERSON = content.attr("PERSON");
                String linkText = content.text();

                //print
                //System.out.println(PERSON);
                System.out.println(linkText);
            }