Question

我的输出编码存在问题。这是其中一个案例：

"<" + this.strName + ">" + strData + "</" + this.strName + ">"
return DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new ByteArrayInputStream(returnFullTagData(strData).getBytes())).getDocumentElement();

On Netbean的调试工作正常，但是当我运行Build版本时，它会抛出3字节UTF-8序列的无效字节2。

我解决了这个问题：

new String( ("<" + this.strName + ">" + strData + "</" + this.strName + ">").getBytes(), "UTF-8");

但是我需要改变它才能像第一个选择那样工作......为什么？，因为这个：

当我尝试保存新的XML文件时，它会在netbeans debug上正确保存：

<kind schema="">Fonología</kind>

但是，构建版本具有相同的编码问题：

<kind schema="">Fonolog?a</kind>

我认为这两个问题都有直接关系，但我不知道如何。

当然，我试图修复这个改变我的XML上的输入数据的编码作为第一种情况，但我不工作

修改

好的，现在我正在使用你的一些答案而且我得到了一些非常有趣的东西。

第一种情况，改为：

strData = "<" + this.strName + ">" + strData2 + "</" + this.strName + ">";
return DocumentBuilderFactory.newInstance().newDocumentBuilder()
                .parse(new InputSource(new StringReader(returnFullTagData(strData))))
                .getDocumentElement();

它工作得很好，没有了??? （和UnsupportedEncodingException不再需要了，喜欢它。）

第二个改变是它读取XML基本文件的方式

DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
        DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();

        FileInputStream in = new FileInputStream(new File(strBase));
        doc = dBuilder.parse(in, "UTF-8");

但现在我有另一个问题：

<li>ArtÃculo Definido</li>

而不是

<li>Artículo Definido</li>

这有点棘手，因为我在本文档中使用了两种类型的节点，并且“基于字符串的”节点打印正确，但“基于文件”的节点有这个问题......

我使用的库是POI，Guava，POI附带的XMLBeans和dom4j

PD：再一次，只有当它是构建版本时才会发生...为什么会发生？，我真的很累，尝试调试它基本没用了

Answer 1

将í替换为?意味着从Unicode（java文本，字符串）转换为字节，使用对无法映射字母的字节进行编码。

使用String.getBytes(StandardCharsets.UTF_8)。（除非有<?xml ...>编码与UTF-8不同。）

避免s = new String(s.getBytes(), "UTF-8");这是一种破解，解决方法，但仍有一些陷阱。

为了良好的秩序：

NetBeans IDE，项目属性/编码：UTF-8
maven pom.xml：<properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>

项目缩短后

没有发现任何可疑之处，请尝试：

public static void printDocument(Document doc, OutputStream out) throws IOException, TransformerException {
    TransformerFactory tf = TransformerFactory.newInstance();
    Transformer transformer = tf.newTransformer();
    //transformer.setOutputProperty("omit-xml-declaration", "no");
    transformer.setOutputProperty("method", "xml");
    transformer.setOutputProperty("indent", "yes");
    //transformer.setOutputProperty("encoding", "UTF-8");
    //transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
    transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "4");

    //transformer.transform(new DOMSource(doc), new StreamResult(new OutputStreamWriter(out, "UTF-8")));
    transformer.transform(new DOMSource(doc), new StreamResult(out));
}

Answer 2

当您在getBytes()上调用String时，您将获得基础平台默认编码中的字节。使用String(byte[])构造函数时，您将使用平台的默认编码将字节转换为String。

将这两者合并为

return new String(("<" + this.strName + ">" + strData + "</" + this.strName + ">").getBytes());

你正在执行一个过时的String到字节的转换，并在最好的情况下回到String，即如果平台的默认编码可以处理所有字符，并且正在破坏信息，如果它不能。然后，看到?而不是这些字符，不要感到惊讶。

这个地方有一个简单的解决方案，只需删除过时的转换：

return "<" + this.strName + ">" + strData + "</" + this.strName + ">";

当然，既然这些字符没有被破坏，它们可能会在预期UTF-8时使用平台默认编码的其他地方引起问题。您可以在String和byte[]之间搜索所有出现的转化，并确保所有使用相同的编码，最好是UTF-8，但您可以还决定删除这些不必要的转换。

如果来源是String个字符，请按原样处理：

return DocumentBuilderFactory.newInstance().newDocumentBuilder()
    .parse(new InputSource(new StringReader(returnFullTagData(strData))))
    .getDocumentElement();

没有转换，没有数据丢失......

Answer 3

好的，谢谢你的帮助，对解决一些问题非常有帮助，不是主要问题，而是任何改进，我们真的很感激。问题是Guava Library，但我不知道为什么会这样。我只是回到我的第一个版本并删除了库; Release项目开始像Debug模式一样正常工作。如果有人能说出为什么会这样，我会更加感激

构建项目上的UTF8编码错误（调试项目工作正常）

3 个答案: