Question

我正在使用Apache POI来读取.docx文件，并在进行一些操作后写入.csv。我正在使用的.docx文件是法语，但是当我用.csv写入数据时，它将某些法语字符转换为特殊字符。示例Être un membre clé转换为ÃŠtre un membre clÃ©

下面的代码用于写入文件

        Path path = Paths.get(filePath);
        BufferedWriter bw = Files.newBufferedWriter(path);
        CSVWriter writer = new CSVWriter(bw);
        writer.writeAll(data);

默认使用UTF-8。

在调试时，我已经在写入.csv之前检查了数据。但是它在编写时被转换了？我已将默认语言环境设置为Locale.FRENCH

我错过了什么吗？

Answer 1

我怀疑是Excel将UTF-8编码的CSV读为ANSI。当您仅在CSV中打开Excel而不使用文本导入向导时，就会发生这种情况。然后，如果文件开头没有Excel，ANSI总是期望BOM。如果您使用支持CSV的文本编辑器打开Unicode，则一切正确。

示例：

import java.io.BufferedWriter;

import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.file.Files;

import java.util.Locale;
import java.util.List;
import java.util.ArrayList;

import com.opencsv.CSVWriter;

class DocxToCSV {

 public static void main(String[] args) throws Exception {

  Locale.setDefault(Locale.FRENCH);

  List<String[]> data = new ArrayList<String[]>();
  data.add(new String[]{"F1", "F2", "F3", "F4"});
  data.add(new String[]{"Être un membre clé", "Être clé", "membre clé"});
  data.add(new String[]{"Être", "un", "membre", "clé"});

  Path path = Paths.get("test.csv");
  BufferedWriter bw = Files.newBufferedWriter(path);

  //bw.write(0xFEFF); bw.flush(); // write a BOM to the file

  CSVWriter writer = new CSVWriter(bw, ';', '"', '"', "\r\n");
  writer.writeAll(data);
  writer.flush();
  writer.close();

 }
}

现在，如果使用支持test.csv的文本编辑器打开Unicode，则所有设置都是正确的。但是，如果您使用Excel打开相同的文件，则它看起来像：

现在我们做同样的事情，但是拥有

bw.write(0xFEFF); bw.flush(); // write a BOM to the file

有效。

当Excel只是由test.csv打开Excel时，结果是这样的Excel：

当然，更好的方法总是使用from tensorflow.compat.v1 import ConfigProto from tensorflow.compat.v1 import InteractiveSession config = ConfigProto() config.gpu_options.allow_growth = True session = InteractiveSession(config=config)的{{3}}。

有关同一问题，另请参见Text Import Wizard。

Answer 2

检查您如何读取最终文件的字符代码。

字符被转换为特殊字符

2 个答案: