我的数据集具有以下结构:
<p>The <ORGANIZATION>Peter Hall Company</ORGANIZATION>'s production of ''Blithe Spirit,'' directed by <PERSON>Thea Sharrock</PERSON>, is one of those attractively and unimaginatively upholstered productions of brittle classics that become must-have middlebrow tickets every few years. Most notable for <PERSON>Penelope Keith</PERSON>'s startlingly brisk and no-nonsense interpretation of the madcap medium <ORGANIZATION>Madame Arcati</ORGANIZATION>, Ms. <PERSON>Sharrock</PERSON>'s take on <PERSON>Coward</PERSON>'s 1941 comedy of a man visited by his dead wife's impish spirit delivers bright badinage, dazed double takes and marital melees at the same efficient clip.</p>
正如您所看到的,在<p>
&amp;标签内。 </p>
有多个标记的实体,例如<ORGANIZATION>Peter Hall Company</ORGANIZATION>
和<PERSON>Penelope Keith</PERSON>
使用jsoup我想列出<p>
个标签中包含的所有实体。
我想jsoup应该能够解决这个问题,我已经看到了一些与特定情况有关的问题,但是我无法让它们在我的情况下工作,这可能是因为<ORGANIZATION>
和<PERSON>
不是真正的HTML标签吗?我必须使用正则表达式吗?如果我能用jsoup做,怎么做?
到目前为止我试过这个:
for (Iterator<Element> iterator = contents.iterator(); iterator.hasNext();)
{
Element content = iterator.next();
String text = content.text();
String title = content.select("PERSON").text();
String output = text.replaceFirst(title, "").trim();
System.out.println(output);
}
和此:
for (Element content : contents)
{
String PERSON = content.attr("PERSON");
String linkText = content.text();
//print
System.out.println(PERSON);
System.out.println(linkText);
}
两者都不起作用。
答案 0 :(得分:2)
您只需要使用css选择器:
public class Foo {
public static void main(String... args) {
String xml = "<p>The <ORGANIZATION>Peter Hall Company</ORGANIZATION>'s production of ''Blithe Spirit,'' directed by <PERSON>Thea Sharrock</PERSON>, is one of those attractively and unimaginatively upholstered productions of brittle classics that become must-have middlebrow tickets every few years. Most notable for <PERSON>Penelope Keith</PERSON>'s startlingly brisk and no-nonsense interpretation of the madcap medium <ORGANIZATION>Madame Arcati</ORGANIZATION>, Ms. <PERSON>Sharrock</PERSON>'s take on <PERSON>Coward</PERSON>'s 1941 comedy of a man visited by his dead wife's impish spirit delivers bright badinage, dazed double takes and marital melees at the same efficient clip.</p>";
Document doc = Jsoup.parse(xml);
for (Element e: doc.select("p > ORGANIZATION, p > PERSON")) {
System.out.printf("-> %s: %s\n", e.tagName(), e.text());
}
}
}
输出:
-> organization: Peter Hall Company
-> person: Thea Sharrock
-> person: Penelope Keith
-> organization: Madame Arcati
-> person: Sharrock
-> person: Coward
编辑:如果你想过滤掉这些标签并保留内容,你可以在迭代它们时用文本内容替换元素,如下所示:
public class Foo {
public static void main(String... args) {
String xml = "<p>The <ORGANIZATION>Peter Hall Company</ORGANIZATION>'s production of ''Blithe Spirit,'' directed by <PERSON>Thea Sharrock</PERSON>, is one of those attractively and unimaginatively upholstered productions of brittle classics that become must-have middlebrow tickets every few years. Most notable for <PERSON>Penelope Keith</PERSON>'s startlingly brisk and no-nonsense interpretation of the madcap medium <ORGANIZATION>Madame Arcati</ORGANIZATION>, Ms. <PERSON>Sharrock</PERSON>'s take on <PERSON>Coward</PERSON>'s 1941 comedy of a man visited by his dead wife's impish spirit delivers bright badinage, dazed double takes and marital melees at the same efficient clip.</p>";
Document doc = Jsoup.parse(xml);
for (Element e: doc.select("p > ORGANIZATION, p > PERSON")) {
System.out.printf("-> %s: %s\n", e.tagName(), e.text());
e.replaceWith(new TextNode(e.text(), ""));
}
System.out.println("\nFiltered out:\n" + doc.select("p").html());
}
}
输出:
-> organization: Peter Hall Company
-> person: Thea Sharrock
-> person: Penelope Keith
-> organization: Madame Arcati
-> person: Sharrock
-> person: Coward
Filtered out:
The Peter Hall Company's production of ''Blithe Spirit,'' directed by Thea Sharrock, is one of those attractively and unimaginatively upholstered productions of brittle classics that become must-have middlebrow tickets every few years. Most notable for Penelope Keith's startlingly brisk and no-nonsense interpretation of the madcap medium Madame Arcati, Ms. Sharrock's take on Coward's 1941 comedy of a man visited by his dead wife's impish spirit delivers bright badinage, dazed double takes and marital melees at the same efficient clip.
答案 1 :(得分:0)
这有效,但不够优雅
//people
Elements contents_person = doc.getElementsByTag("p").select("PERSON");
for (Element content : contents_person)
{
//String PERSON = content.attr("PERSON");
String linkText = content.text();
//print
//System.out.println(PERSON);
System.out.println(linkText);
}
//places
Elements contents_place = doc.getElementsByTag("p").select("LOCATION");
for (Element content : contents_place)
{
//String PERSON = content.attr("PERSON");
String linkText = content.text();
//print
//System.out.println(PERSON);
System.out.println(linkText);
}
//things
Elements contents_things = doc.getElementsByTag("p").select("ORGANIZATION");
for (Element content : contents_things)
{
//String PERSON = content.attr("PERSON");
String linkText = content.text();
//print
//System.out.println(PERSON);
System.out.println(linkText);
}