给定.txt文件,我需要小写并删除标点符号

时间:2014-03-13 00:54:58

标签: java

我正在开展一个较大的小组项目的一小部分。在此范围内,我将接收文本文件并将其转换为更容易压缩的文本文件。为此,我将降低所有大写单词以及用空格替换标点符号(即“”)。我感谢所有的意见和建议。

import java.io.*;
public class Formatter
{

    public static void main (String[] args) throws IOException
    {
        String nonChar = ".,:;!@#$%^&*()_-=+[]\"'<>";
        File f1 = new File("iTest.txt");
        File f2 = new File("oTest.txt");
        BufferedReader in = (new BufferedReader(new FileReader(f1)));
        PrintWriter out = (new PrintWriter(new FileWriter(f2)));

        int ch;
        while ((ch = in.read()) != -1)
        {
            if (Character.isUpperCase(ch))
            {
                ch = Character.toLowerCase(ch);
            }
            else if (in.contains(Character[ch]))//tried character
            {
                ch = ' ';
            }
            out.write(ch);
        }

        in.close();
        out.close();

    }
}

理想情况下,如果给出了

Peter Piper picked a peck of pickled peppers;
A peck of pickled peppers Peter Piper picked;
If Peter Piper picked a peck of pickled peppers,
Where's the peck of pickled peppers Peter Piper picked?

它将返回

peter piper picked a peck of pickled peppers
a peck of pickled peppers peter piper picked
if peter piper picked a peck of pickled peppers
where s the peck of pickled peppers peter piper picked

2 个答案:

答案 0 :(得分:4)

逐行读取String并执行操作:

BufferedReader in = (new BufferedReader(new FileReader(f1)));
String line;
String processedLine="";
while ((line = in.readLine()) != null) {
    processedLine = line.replaceAll("[^a-zA-Z0-9]"," ").toLowerCase().replaceAll("( )+", " ");
    out.write(processedLine);
    out.write(System.getProperty("line.separator"));
}

注意:如果文字包含一些独特字符(重音字符),例如line.replaceAll("(?U)[^\\p{Alnum}]"," ")等,则可以使用é

答案 1 :(得分:1)

你可以在几行内完成这项工作

String text;
BufferedReader in = (new BufferedReader(new FileReader(f1)));
text = in.readLine();
text = text.replaceAll("[^\\w\\s\\ ]", " ").toLowerCase();

如果文本只是一行,这将有效,如果它是多行,你只需要循环上面的代码。