规范化文本 - 读取每个字符并删除空格 - 错误编码

时间:2014-10-22 22:05:08

标签: string file encoding character writing

我有一个问题,我无法解决..有人可以帮助我吗?

好的,所以我想要一个程序来规范我的文本,它会删除多个空格,它会打印原始文件中的其他字符,还会放置空格以及开始和结束符号。

所以转换,在我写完txt文件并打开它之后,我看到了这个内容:

  

numaituaã§ãoceoemergãªnciamã©dica

你可以看到有一些我不想要的怪异角色,也许是因为编码? 这是我的语言文本,葡萄牙语。

这是我的代码,我该如何解决?

public static void main(String[] args) throws IOException {

        Charset encoding = Charset.defaultCharset();

        InputStream in = new FileInputStream(new File("data.txt"));
        Reader reader = new InputStreamReader(in, encoding);
        Reader buffer = new BufferedReader(reader);
        StringBuilder normalizedLanguage = new StringBuilder("<");
        int r;
        while ((r = buffer.read()) != -1) {
            char ch = (char) r;




            boolean newline = false;
            boolean hasLetterBefore = false;
            boolean hasLetterAfter = false;
            char symbol = '-';
            int lines = 0;

            if (newline)
            {
                normalizedLanguage.append("\n<");
            }


            if (ch == '\r' || ch == '\n' )
            {
                lines++;
                normalizedLanguage.append(">");
                newline = true;
                hasLetterBefore = false;


            }
            else if (Character.isLetterOrDigit(ch))
            {
                if (hasLetterBefore == true)
                {
                    normalizedLanguage.append(Character.toString(symbol) + Character.toString(Character.toLowerCase(ch)));
                }else{
                    normalizedLanguage.append(Character.toString(Character.toLowerCase(ch)));
                }


                newline = false;
                hasLetterBefore = true;
            }
            else if (ch == ' ')
            {
                normalizedLanguage.append(Character.toString(ch));
                newline = false;
                hasLetterBefore = false;
            }
            else if (ch == '\t')
            {
                System.out.println("Tab detected: " + ch);
                newline = false;
                hasLetterBefore = false;
            }
            else
            {
                //Símbolos, entre outros..
                if (!hasLetterBefore)
                {
                    normalizedLanguage.append(" " + Character.toString(ch) + " ");
                }
                else
                {
                    symbol = ch;
                }
                newline = false;

            }


        }

        String normalizedLanguageString = normalizedLanguage.toString().trim().replaceAll(" +", " ");

        PrintWriter out = new PrintWriter("data_after.txt");

        out.println(normalizedLanguageString);
        out.close();

        buffer.close();
        reader.close();
        in.close();

    }

非常感谢您提前;)

1 个答案:

答案 0 :(得分:0)

使用另一个Charset编码解决了问题:)

更改此行:

Charset encoding = Charset.defaultCharset();

要:

Charset encoding = Charset.forName("UTF8");

非常感谢你