从xml中读取和解析中文字符

时间:2018-05-09 09:52:53

标签: java character-encoding

当我从xml文件中读取中文字符时,我收到一些非法或编码错误的字符。我无法使用DOM / SAX解析xml文件。我试图指定编码“UTF-8”,但我仍然没有得到正确的输出。有时候我会收到问号(?)而不是中文字符。

我的要求是,我有一个带有中文字符的xml文件。我需要从文件中读取和解析中文字符然后我需要将其放回另一个文件中。 请帮我解决这个问题。这是我的代码。

TestMain.java

import java.io.BufferedReader;
import java.io.ByteArrayInputStream;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.InputStreamReader;
import java.io.StringWriter;
import javax.xml.bind.JAXBContext;
import javax.xml.bind.Unmarshaller;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.w3c.dom.Document;

public class TestMain {
    public static void main(String[] args) throws Exception {
        // TODO Auto-generated method stub
        BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream("C:\\temp\\myInputFile.txt")));
        StringBuilder out = new StringBuilder();
        String line;
        while ((line = reader.readLine()) != null) {
            out.append(line);
        }
        reader.close();
        System.out.println(out.toString());
        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        DocumentBuilder builder = factory.newDocumentBuilder();
        Document doc = builder.parse(new ByteArrayInputStream(out.toString().getBytes("UTF-8")));
        DOMSource domSource = new DOMSource(doc);
        StringWriter writer = new StringWriter();
        StreamResult result = new StreamResult(writer);
        TransformerFactory tf = TransformerFactory.newInstance();
        Transformer transformer = tf.newTransformer();
        transformer.transform(domSource, result);
        JAXBContext context = JAXBContext.newInstance(Sender.class);
        Unmarshaller unmarshaller = context.createUnmarshaller();
        Sender sender = (Sender) unmarshaller.unmarshal(new ByteArrayInputStream(writer.toString().getBytes("UTF-8")));
        System.out.println(sender.toString());
        FileOutputStream fos = new FileOutputStream("C:\\temp\\myOutputFile.txt");
        fos.write(sender.toString().getBytes());
        fos.flush();
        fos.close();
    }
}

Sender.java

import javax.xml.bind.annotation.XmlAccessType;
import javax.xml.bind.annotation.XmlAccessorType;
import javax.xml.bind.annotation.XmlElement;
import javax.xml.bind.annotation.XmlRootElement;
import javax.xml.bind.annotation.XmlType;

@XmlAccessorType(XmlAccessType.FIELD)
@XmlType(name = "", propOrder = { "name" })
@XmlRootElement(name = "sender")
public class Sender {
    @XmlElement(required = true)
    protected String name;
    public String getName() {
        return name;
    }
    public void setName(String value) {
        this.name = value;
    }
    @Override
    public String toString() {
        // TODO Auto-generated method stub
        return "<sender><name>"+this.name+"</name></sender>";
    }
}

myInputFile.txt

<sender><name>奥迪普时装(深圳)有限公司</name></sender>

myOutputFile.txt

<sender><name>奥迪普时装(深圳)有陿公忸</name></sender>

在输出文件中,我们可以看到第一和第二的差异。从右到左的第3个字符。

1 个答案:

答案 0 :(得分:0)

我已经找到了解决方案。

我们需要在读取文件时对输入流使用UTF-8字符集编码,我们需要使用PrintStream为输出流设置UTF-8编码。

//While reading the file
BufferedReader reader = new BufferedReader(new InputStreamReader(new 
FileInputStream("C:\\temp\\myInputFile.txt"), "UTF-8"));

//While writing the file
PrintStream ps = new PrintStream(fos, true, "UTF-8");
ps.print(sender.toString());
ps.close();