VTD-XML元素片段不正确

时间:2018-07-24 20:54:50

标签: vtd-xml

当使用VTD-XML解析包含©这样的特殊字符的XML文档(在UTF-8中)时,我现在遇到一个问题,即返回的元素片段(getElementFragment)不正确。

示例代码:

VTDGen vg = new VTDGen();
String xmlDocument =
        "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\r\n" + 
        "<Root>\r\n" + 
        "  <!-- © -->\r\n" + 
        "  <SomeElement/>\r\n" + 
        "</Root>";
// For some reason with US_ASCII it does work, although the file is UTF-8.
vg.setDoc(xmlDocument.getBytes(StandardCharsets.UTF_8));
// True or false doesn't matter here, some result.
vg.parse(false);
// Find the element and its fragment.
VTDNav nv = vg.getNav();
AutoPilot ap = new AutoPilot(nv);
ap.selectXPath("//SomeElement");
while ((ap.evalXPath()) != -1) {
    long elementOffset = nv.getElementFragment();
    int contentStartIndex = (int)elementOffset;
    int contentEndIndex = contentStartIndex + (int)(elementOffset>>32);
    System.out.println("Returned fragment: " + contentStartIndex + ":" + contentEndIndex + ":\n'" + xmlDocument.substring(contentStartIndex, contentEndIndex) + "'");
}

这将返回:

Returned fragment: 65:79:
'SomeElement/>
'

将StandardCharsets.UTF_8更改为StandardCharsets.US_ASCII时,它确实起作用:

Returned fragment: 64:78:
'<SomeElement/>'

当输入文件是UTF-8文件时,这将导致错误的行为。这可能是VTD-XML中的错误,还是我在这里做错了什么?

1 个答案:

答案 0 :(得分:0)

“©”是一个两字的unicode字符,它会使起始/结束unicode偏移量从起始/结束字节偏移量偏移1。这不是bug,下面是解决方法

while ((ap.evalXPath()) != -1) {
            long elementOffset = nv.getElementFragment();
            int contentStartIndex = (int)elementOffset;
            int contentEndIndex = contentStartIndex + (int)(elementOffset>>32);
            System.out.println("Returned fragment: " + contentStartIndex + ":" + contentEndIndex + ":\n'" 
                    + nv.toString(contentStartIndex,(int)(elementOffset>>32)));
                    //+ xmlDocument.substring(contentStartIndex, contentEndIndex) + "'");
        }