Question

我遇到的问题是我的XML解析器（VTD-XML）似乎无法处理Unicode补充字符（如果我在这里已经错了，请更正）。看起来，解析器只使用16位这样的字符。

我无法切换到我所占用的项目中的另一个解析器。我正在解析Medline摘要（https://www.ncbi.nlm.nih.gov/pubmed），似乎在去年添加了包含补充字符的文档（例如https://www.ncbi.nlm.nih.gov/pubmed/?term=26855708，结果部分的结尾）。

作为一个快速而又脏的修复，我只是从文档中删除0xFFFF以上的所有字符。显然，这会破坏文档中的一些表达，所以我对这个解决方案并不满意。

由于我无法更改解析器，我想知道是否存在将补充字符映射到BMP中可能具有类似外观的字符（如果存在）的字符的可能性。

我当然欢迎任何其他想法。用某种占位符替换补充字符然后将原始字符重新放入，这似乎很好，但这似乎容易出错。更好的想法？

编辑：这里有一些 - 希望是 - VTD-XML如何解决这个问题的最小例子：

@Test
public void parseUnicodeBeyondBMP() throws NavException, FileNotFoundException, IOException, EncodingException, EOFException, EntityException, ParseException {
    // character codpoint 0x10400
    String unicode = "<supplementary>\uD801\uDC00</supplementary>";
    byte[] unicodeBytes = unicode.getBytes();
    assertEquals(unicode, new String(unicodeBytes, "UTF-8"));

    VTDGen vg = new VTDGen();
    vg.setDoc(unicodeBytes);
    vg.parse(false);
    VTDNav vn = vg.getNav();
    long fragment = vn.getContentFragment();
    int offset = (int) fragment;
    int length = (int) (fragment >> 32);
    String originalBytePortion = new String(Arrays.copyOfRange(unicodeBytes, offset, offset+length));
    String vtdString = vn.toRawString(offset, length);
    // this actually succeeds
    assertEquals("\uD801\uDC00", originalBytePortion);
    // this fails ;-( the returned character is Ѐ, codepoint 0x400, thus the high surrogate is missing
    assertEquals("\uD801\uDC00", vtdString);
}

将补充Unicode字符映射到BMP（如果可能）

0 个答案: