如何使用java删除给定文本文件的所有换行符和paragrah中断?

时间:2017-10-15 23:16:27

标签: java regex file

我有一个巨大的文本文件。 我想删除所有换行符,并希望段落中断也被删除并附加到前一个段落。我应该如何使用java?我在java中使用了replaceALL(),但我仍然坚持将段落附加到前一段。

Please view this image for the file screenshot

    public void map(LongWritable key, Text value, Context context ) throws IOException, InterruptedException{ 
            StringBuilder sb = new StringBuilder();
            System.out.println(value.toString().replaceAll("[\\t\\n]+", ""));
            StringTokenizer itr = new StringTokenizer(value.toString().replaceAll("[\\t\\n]+", ""));        
            String[] tokens = new String[itr.countTokens()*2];

            for(int l = 0 ; l<tokens.length;l++){
                if(itr.hasMoreTokens()){
                    tokens[l] = itr.nextToken();

                }
            }
                    for(int i = 0; i < tokens.length; i++){
                    if(tokens[i] != null && tokens[i] != " "){
                        sb.append(tokens[i]);
                            for(int j = i+1;j<i+5;j++){
                                if(tokens[j] != null)
                                {
                                sb.append(" ");
                                sb.append(tokens[j]);
                                }

                            }
                    }
                        word.set(sb.toString());
                        context.write(word, one);
                        //System.out.println(sb.toString());
                        sb.setLength(0);

                    }
        }

输入:

The Project Gutenberg EBook of The Complete Works of William Shakespeare, by
William Shakespeare
sn
This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org

** This is a COPYRIGHTED Project Gutenberg eBook, Details Below **
**     Please follow the copyright guidelines in this file.     **

Title: The Complete Works of William Shakespeare

Author: William Shakespeare

Posting Date: September 1, 2011 [EBook #100]
Release Date: January, 1994

Language: English


*** START OF THIS PROJECT GUTENBERG EBOOK COMPLETE WORKS--WILLIAM SHAKESPEARE ***




Produced by World Library, Inc., from their Library of the Future

This is the 100th Etext file presented by Project Gutenberg, and
is presented in cooperation with World Library, Inc., from their
Library of the Future and Shakespeare CDROMS.  Project Gutenberg
often releases Etexts that are NOT placed in the Public Domain!!

Shakespeare

*This Etext has certain copyright implications you should read!*

预期产出:

The Project Gutenberg EBook of The Complete Works of William Shakespeare, by
William Shakespeare sn This eBook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever.  You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org ** This is a COPYRIGHTED Project Gutenberg eBook, Details Below Please follow the copyright guidelines in this file.Title: The Complete Works of William Shakespeare Author: William Shakespeare Posting Date: September 1, 2011 [EBook #100]
Release Date: January, 1994 Language: English START OF THIS PROJECT GUTENBERG EBOOK COMPLETE WORKS--WILLIAM SHAKESPEARE Produced by World Library, Inc., from their Library of the Future This is the 100th Etext file presented by Project Gutenberg, and is presented in cooperation with World Library, Inc., from their Library of the Future and Shakespeare CDROMS.  Project Gutenberg often releases Etexts that are NOT placed in the Public Domain!! Shakespeare *This Etext has certain copyright implications you should read!*

2 个答案:

答案 0 :(得分:0)

如果您只想要单词,可以使用\ w搜索单词并将它们连接起来。

public static void main(String args[]) {
    final String input = "hello, how are you today how was school today, what did you have for food? this star needs to be removed ****";
    final String regex = "\\w+";
    final Matcher m = Pattern.compile(regex).matcher(input);

    String output = "";
    while (m.find()) {
        output += m.group(0)+" ";
    }
    System.out.println(output);
}

结果:

hello how are you today how was school today what did you have for food this star needs to be removed 

答案 1 :(得分:0)

对真实标签,换行使用字符串文字转义。不要忘记回车(在Windows上)。

String text = value.toString()
    .replaceAll("(\r?\n){2}", "§") // Two line breaks will become a real line break.
    .replaceAll("[\t\r\n]+", " ") // White space will become a real space.
    .replace("§", "\n"); // The real line breaks.

而不是§可能会使用一些深奥的字符uFEFF

将转

Good Morning,

How are you?
I am fine.

Good Morning,
How are you? I am fine.