我需要从以这种方式格式化的XML文件中提取一些节点:
<collection sentiment="negativo">
<comment>
<sentiment> ...</sentiment>
<chars>...</chars>
<words>...</words>
<text>blabla</text>
<lang>english</lang>
</comment>
现在假设在同一个XML文件中有其他<comment>
元素具有<lang>spanish</lang>
。
我需要创建两个单独的XML文件。第一个让ALL THE NODES拥有孩子<lang>english</lang>
(让我们称之为eng.xml),第二个拥有<lang>spanish</lang>
(让我们称之为spa.xml)
这是我的JAVA代码:
public void getEnglishRows() throws IOException{
OutputStreamWriter f = new OutputStreamWriter(new FileOutputStream("C:/eclipse/neg_eng.xml"));
BufferedWriter buff;
NodeList current_row = doc.getElementsByTagName("comment"); //Mette in una lista tutti i nodi row (che contengono a loro volta degli elementi)
NodeList tmp;
Node nodo = null;
buff = new BufferedWriter(f);
for(int i=0;i< current_row.getLength();i++){
tmp = current_row.item(i).getChildNodes();
for(int k=0;k<tmp.getLength();k++){
nodo = tmp.item(k);
if("english".equals(nodo.getTextContent()))
System.out.println("IF ENGLISH");
buff.write(current_row.item(i).getNodeValue());
}
}
buff.close();
}
我不知道我是否清楚,我希望如此。
所以我有一个很多<comment></comment>
的Xml文件。我要从这个全部<comment></comment>
中提取<lang>english</lang>
并将节点(带有它的子节点)写入另一个XML文件。 <lang>spanish</lang>
的行为相同。
eng.xml的输出是:
<comment>
<sentiment> ...</sentiment>
<chars>...</chars>
<words>...</words>
<text>blabla</text>
<lang>english</lang>
</comment>
spa.xml的输出是:
<comment>
<sentiment> ...</sentiment>
<chars>...</chars>
<words>...</words>
<text>blabla</text>
<lang>spanish</lang>
</comment>
我希望我很清楚。我的问题是我可以提取所有节点的文本,但它不会保留XML标签!!
请帮助我!
答案 0 :(得分:0)
为什么不尝试删除不是英文的评论? 所以我的建议是搜索标签并检测非英语标签。然后转到包含节点(元素)的父元素并删除它。这样可以保留原始文件结构。
试试这段代码。它对我有用:)
public void getEnglishRows() throws IOException, SAXException, ParserConfigurationException, TransformerException{
OutputStreamWriter f = new OutputStreamWriter(new FileOutputStream("./eng_sent.xml"));
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(new FileInputStream("C:/eclipse/neg_eng.xml"));
NodeList current_row = doc.getElementsByTagName("lang"); // search for the lang element
for(int i=0;i< current_row.getLength();i++){
String lang = current_row.item(i).getTextContent();
if (!lang.equalsIgnoreCase("english")) {
// delete not english comment
Element comment = (Element) current_row.item(i).getParentNode();
doc.getDocumentElement().removeChild(comment);
doc.normalize();
}
}
// write the content into xml file
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
DOMSource source = new DOMSource(doc);
StreamResult result = new StreamResult(f);
transformer.transform(source, result);
}
文件neg_eng将如下所示:
<collection sentiment="negativo">
<comment>
<sentiment> ...</sentiment>
<chars>...</chars>
<words>...</words>
<text>eng3</text>
<lang>english</lang>
</comment>
<comment>
<sentiment> ...</sentiment>
<chars>...</chars>
<words>...</words>
<text>eng1</text>
<lang>english</lang>
</comment>
<comment>
<sentiment> ...</sentiment>
<chars>...</chars>
<words>...</words>
<text>eng2</text>
<lang>english</lang>
</comment>
原始xml文件是:
<collection sentiment="negativo">
<comment>
<sentiment> ...</sentiment>
<chars>...</chars>
<words>...</words>
<text>eng3</text>
<lang>english</lang>
</comment>
<comment>
<sentiment> ...</sentiment>
<chars>...</chars>
<words>...</words>
<text>spa2</text>
<lang>spanish</lang>
</comment>
<comment>
<sentiment> ...</sentiment>
<chars>...</chars>
<words>...</words>
<text>eng1</text>
<lang>english</lang>
</comment>
<comment>
<sentiment> ...</sentiment>
<chars>...</chars>
<words>...</words>
<text>eng2</text>
<lang>english</lang>
</comment>
<comment>
<sentiment> ...</sentiment>
<chars>...</chars>
<words>...</words>
<text>spa1</text>
<lang>spanish</lang>
</comment>
希望这会对你有所帮助! 快乐黑客; - )