Question

我有这个代码来搜索文档并将句子保存到ArrayList<StringBuffer>并将此对象保存在文件中

public static void save(String doc_path) {
    StringBuffer text  = new StringBuffer(new Corpus().createDocument(doc_path + ".txt").getDocStr());
    ArrayList<StringBuffer> lines = new ArrayList();
    Matcher matcher = compile("(?<=\n).*").matcher(text);
    while (matcher.find()) { 
        String line_str = matcher.group();
        if (checkSentenceLine(line_str)){
            lines.add(new StringBuffer(line_str));
        }          
    }
    FilePersistence.save (lines, doc_path + ".lin");  
    FilePersistence.save (lines.toString(), doc_path + "_extracoes.txt");
}

语料库

public Document createDocument(String file_path) {
    File file = new File(file_path);
    if (file.isFile()) {
        return new Document(file);
    } else {
        Message.displayError("file path is not OK");
        return null;
    }
}

FilePersistence

public static void save (Object object_root, String file_path){
    if (object_root == null) return;
    try{
        ObjectOutputStream output = new ObjectOutputStream(new FileOutputStream (file_path));
        output.writeObject(object_root);
        output.close();
    } catch (Exception exception){
        System.out.println("Fail to save file: " + file_path + " --- " + exception);
    }
}

public static Object load (String file_path){
    try{            
        ObjectInputStream input = new ObjectInputStream(new FileInputStream (file_path));
        Object object_root = input.readObject();
        return object_root;
    }catch (Exception exception){
        System.out.println("Fail to load file: " + file_path + " --- " + exception);
        return null;
    }
}

问题是，文档有一些正确的单引号字符作为撇号，当我加载它时并且在屏幕上打印我在netBeans上得到一些odd squares instead of apostrophes，如果我在记事本上打开文件，这会阻止我正确处理提取的句子或者至少正确地显示它们。起初我认为这是由于编码不兼容。

然后我尝试将项目属性的编码更改为CP1252，但它只将空白方块更改为问号，而记事本仍然相同Â'

我也尝试过使用

String line_str = matcher.group().replace("’","'")

和

String line_str = matcher.group().replace('\u2019','\')

但它什么都不做

更新

if (checkSentenceLine(line_str)){
        System.out.println(line_str);
        lines.add(new StringBuffer(line_str));
    }

这是在保存到二进制文件之前。它已经弄乱了单引号。显示为UTF8中的空白方块和？在CP1252中。让我觉得问题是从.txt

读取时的问题奇怪的是，如果我这样做：

System.out.println('\u2019');

显示完美正确的单引号。问题是只有从.txt文件中读取时，才会让我认为这是我用来从文件中读取的方法的问题。它也适用于子弹点符号。

将StringBuffer解析为String时可能遇到问题？如果是这样，我怎么能防止这种情况发生？

java - '正确的单引号char文件写作和阅读

0 个答案: