我有以下xml:
<TEI>
<xi:include href="header.xml"/>
<text>
<body>
<!-- morph_1-p is akapit 7300 with instances (akapit_transzy-s) 14598, 14618 in batches (transza-s) 1461, 1463 resp. -->
<p corresp="ann_segmentation.xml#segm_1-p" xml:id="morph_1-p">
<s corresp="ann_segmentation.xml#segm_1.35-s" xml:id="morph_1.35-s">
<seg corresp="ann_segmentation.xml#segm_1.1-seg" xml:id="morph_1.1-seg">
<fs type="morph">
<f name="orth">
<string>Sami</string>
</f>
<!-- Sami [0,4] -->
<f name="interps">
<fs type="lex" xml:id="morph_1.1.1-lex">
<f name="base">
<string>sam</string>
</f>
<f name="ctag">
<symbol value="adj"/>
</f>
<f name="msd">
<vAlt>
<symbol value="pl:nom:m1:pos" xml:id="morph_1.1.1.1-msd"/>
<symbol value="pl:voc:m1:pos" xml:id="morph_1.1.1.2-msd"/>
</vAlt>
</f>
</fs>
</f>
<f name="disamb">
<fs feats="#an8003" type="tool_report">
<f fVal="#morph_1.1.1.1-msd" name="choice"/>
<f name="interpretation">
<string>sam:adj:pl:nom:m1:pos</string>
</f>
</fs>
</f>
</fs>
</seg>
在此xml中,仅重复节点(所有父节点仅重复一次)
我正在尝试获取:
<f name="orth">
<string>Sami</string>
</f>
和:
<f name="interpretation">
<string>sam:adj:pl:nom:m1:pos</string>
</f>
在整个xml中都不存在丢失的情况。
这是我的代码:
InputStream inputStream = new FileInputStream(file);
Reader inputStreamReader = new InputStreamReader(inputStream, "UTF-8");
InputSource inputSource = new InputSource(inputStreamReader);
inputSource.setEncoding("UTF-8");
DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder documentBuilder = documentBuilderFactory.newDocumentBuilder();
Document document = documentBuilder.parse(inputSource);
document.getDocumentElement().normalize();
NodeList nodeListSeg = document.getElementsByTagName("seg");
for(int i = 0; i < nodeListSeg.getLength(); i++) {
if(nodeListSeg.item(i).getFirstChild().getFirstChild().getNodeType() == Node.ELEMENT_NODE)
words.add(((Element) nodeListSeg.item(i).getFirstChild().getFirstChild()).getTextContent().trim());
if(nodeListSeg.item(i).getLastChild().getNodeType() == Node.ELEMENT_NODE)
words.add(((Element) nodeListSeg.item(i).getLastChild()).getTextContent().trim());
}
inputStreamReader.close();
inputStream.close();
我尝试的另一种方法是检查属性值:
if(((Element) nodeListSeg.item(i).getFirstChild()).getAttribute("name").equals("orth")) {...}
if(((Element) nodeListSeg.item(i).getFirstChild()).getAttribute("name").equals("interpretation"))
但是这种比较永远不会返回true。
答案 0 :(得分:0)
结果证明是这样的:
NodeList nodeListSeg = document.getElementsByTagName("seg");
for(int i = 0; i < nodeListSeg.getLength(); i++) {
NodeList nodeListChildren = nodeListSeg.item(i).getChildNodes();
for(int j = 0; j < nodeListChildren.getLength(); j++) {
if(nodeListChildren.item(j).getNodeType() == Node.ELEMENT_NODE) {
String text = ((Element) nodeListChildren.item(j)).getTextContent().toLowerCase().trim();
String[] stringArray = text.split(" ");
System.out.println(stringArray[0] + "\t" + stringArray[stringArray.length - 1]);
}
}
}
因此,发现text
节点中的所有FS
节点均未正确解析,因此都被视为一个元素。
答案 1 :(得分:0)
和往常一样,XPath在这里是更好的选择:
Document doc = DocumentBuilderFactory
.newInstance()
.newDocumentBuilder()
.parse(new File(...));
XPath xp = XPathFactory
.newInstance()
.newXPath();
String s1 = (String) xp.evaluate("//f[@name='orth']/string/text()", doc, XPathConstants.STRING);
System.out.println(s1);
String s2 = (String) xp.evaluate("//f[@name='interpretation']/string/text()", doc, XPathConstants.STRING);
System.out.println(s2);