我正在尝试从带有内容的html标签字符串中提取文本。
例如:
<CalaisSimpleOutputFormat>
<Country count="13" relevance="0.771" normalized="China">China</Country>
<Country count="4" relevance="0.598">Taiwan</Country>
<City count="3" relevance="0.491" normalized="Beijing,China">Beijing</City>
<NaturalFeature count="3" relevance="0.415">Yellow river</NaturalFeature>
<Organization count="2" relevance="0.491">Communist Party</Organization>
<Region count="2" relevance="0.258">Central Asia</Region>
<Region count="2" relevance="0.315">East Asia</Region>
<City count="1" relevance="0.304" normalized="Shanghai,China">Shanghai</City>
<City count="1" relevance="0.304" normalized="Chongqing,China">Chongqing</City>
<City count="1" relevance="0.101" normalized="Taipei,Taiwan">Taipei</City>
<City count="1" relevance="0.304" normalized="Tianjin,China">Tianjin</City>
<Continent count="1" relevance="0.053">Asia</Continent>
<Country count="1" relevance="0.101" normalized="Japan">Japan</Country>
<Country count="1" relevance="0.304" normalized="Macau">Macau</Country>
<MedicalCondition count="1" relevance="0.160">hereditary monarchies</MedicalCondition>
<NaturalFeature count="1" relevance="0.254">Himalaya</NaturalFeature>
<NaturalFeature count="1" relevance="0.274">Gobi desert</NaturalFeature>
<NaturalFeature count="1" relevance="0.208">Yellow sea</NaturalFeature>
<NaturalFeature count="1" relevance="0.208">Pacific Ocean</NaturalFeature>
<NaturalFeature count="1" relevance="0.291">Great Lakes</NaturalFeature>
<NaturalFeature count="1" relevance="0.231">Yangtze river</NaturalFeature>
<NaturalFeature count="1" relevance="0.274">Taklamakan desert</NaturalFeature>
<NaturalFeature count="1" relevance="0.208">South China sea</NaturalFeature>
<NaturalFeature count="1" relevance="0.231">Tibetan Plateau</NaturalFeature>
<NaturalFeature count="1" relevance="0.208">Bohai sea</NaturalFeature>
<NaturalFeature count="1" relevance="0.208">East sea</NaturalFeature>
<NaturalFeature count="1" relevance="0.254">Tian Shan mountain ranges</NaturalFeature>
<Organization count="1" relevance="0.062">G-20</Organization>
<Organization count="1" relevance="0.073">U.N. Security Council</Organization>
<Organization count="1" relevance="0.062">APEC</Organization>
<Organization count="1" relevance="0.062">BRICS</Organization>
<Organization count="1" relevance="0.062">BCIM</Organization>
<Organization count="1" relevance="0.073">United Nations</Organization>
<Organization count="1" relevance="0.062">Shanghai Cooperation Organisation</Organization>
<Organization count="1" relevance="0.062">World Trade Organization</Organization>
<Organization count="1" relevance="0.105">ROC government</Organization>
<Position count="1" relevance="0.073">permanent member</Position>
<Region count="1" relevance="0.208">East China</Region>
<Region count="1" relevance="0.208">South China</Region>
<Region count="1" relevance="0.254">South Asia</Region>
<Region count="1" relevance="0.184">North China</Region>
<Topics>
<Topic Taxonomy="Calais" Score="0.558">Politics</Topic>
<Topic Taxonomy="Calais" Score="0.534">War_Conflict</Topic>
</Topics>
</CalaisSimpleOutputFormat>
代码已成功从thoes标签中提取文本,输出为:
ChinaChongqingShanghaiTaipeiTianjin................
我想知道是否有办法逐个提取文本或将其拆分为空格,以便我可以将其存储到列表中。例如:
China
Chongqing
Shanghai
Taipei
......
我尝试过以下代码:
Document doc = Jsoup.parse(html);
for (Element a : doc.select("CalaisSimpleOutputFormat")) {
System.out.println(a.text());
}
和
for (Node child : XX.childNodes()) {
if (child instanceof TextNode) {
System.out.println(((TextNode) child).text());
}
}
和
Document doc = Jsoup.parse(html);
Element start = doc.select("CalaisSimpleOutputFormat").first();
String text = start.text();
两者都不起作用......有什么建议吗?
答案 0 :(得分:1)
此程序将您的需求数据保存到ArrayList对象
package com.loknath.lab;
/*
*@Author Loknath
*/
import java.io.FileNotFoundException;
import java.util.ArrayList;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.parser.Tag;
import org.jsoup.select.Elements;
public class Test {
public static void main(String[] args) {
ArrayList list = new ArrayList();
Test test = new Test();
String file = "OCtest.txt";
try {
list = test.entityExtractionByFile(file);
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.println(list);
}
public ArrayList entityExtractionByFile(String fileLocation)
throws FileNotFoundException {
String content;
ArrayList list = new ArrayList();
// You may want to change to sth else to read in the file as string
FileToString fileIn = new FileToString();
content = fileIn.convertFile(fileLocation);
Document doc = Jsoup.parse(content);
Element element = doc.select("CalaisSimpleOutputFormat").first();
Elements divChildren = element.children();
Elements detachedDivChildren = new Elements();
for (Element elem : divChildren) {
Element detachedChild = new Element(Tag.valueOf(elem.tagName()),
elem.baseUri(), elem.attributes().clone());
detachedDivChildren.add(detachedChild);
}
for (Element elem : divChildren) {
list.add(elem.ownText());
System.out.println(elem.ownText());
}
return list;
}
}
输出:
China
Taiwan
Beijing
.
.
.
.
表示整个源代码[click here...]