如何从java

时间:2016-08-28 23:21:58

标签: java regex

我收到了一个文件,里面有很多段落。我期待的输出是我一次读取一个段落并对其执行操作。

final String PARAGRAPH_SPLIT_REGEX = "(?m)(?=^\\s{4})";

        String currentLine;

        final BufferedReader bf = new BufferedReader(new FileReader("filename"));


            currentLine = bf.readLine();

            final StringBuilder stringBuilder = new StringBuilder();
            while(currentLine !=null) {

                stringBuilder.append(currentLine);
                stringBuilder.append(System.lineSeparator());
                currentLine = bf.readLine();
            }

            String[] paragraph= new String[stringBuilder.length()];

            if(stringBuilder!=null) {

                final String value = stringBuilder.toString();
                paragraph = value.split(PARAGRAPH_SPLIT_REGEX);
            }

            for (final String s : paragraph) {

                System.out.println(s);
            }

文件(每个段落前面都有2个字符的空格,段落之间没有空行):

  

故事

     

她的同伴仪器设定了对性的非常关注   动不动。财产男人为什么最小的优雅日坚持   需要。询问正义国家老放置坐任何十个年龄。   看起来风险正义显然完全是他的能力。是的   失去女孩很长一段时间   “Trifling想知道你打开他的包装。在家庭确定的情况下   难以忍受的小事。很多人不喜欢下一个整洁。   把她的假设放在不享受的行为上。在他拉对象   别人。“
  通过它的十个领导心脏去除亲切。偏爱任何惊讶   毫无保留的夫人。繁荣了解中产阶级的信念   不常见。假设是早餐或完美解决早餐。是   从先生那里画了山。谷二十二指示我   出发缺陷安排狂喜确实相信他都有   支持的。家庭月持续了简单的自然庸俗他。   “图片为尝试的快乐兴奋十个举止的方式谈论如何。   怀疑忽视他解决了在一个人看来的协议。“

但是,我没有达到预期的输出。段落变量只包含两个值

  1. 文件标题
  2. 文件的其余内容。
  3. 我想,我试图在这里使用的正则表达式不起作用。 我从这里收集的正则表达式。 Splitting text into paragraphs with regex JAVA

    我正在使用java8。

3 个答案:

答案 0 :(得分:2)

您可以使用带有分隔符的Scanner来迭代文本。例如:

Scanner scanner = new Scanner(text).useDelimiter("\n  ");
while (scanner.hasNext()) {
    String paragraph = scanner.next();
    System.out.println("# " + paragraph);
}

输出结果为:

#                       Story

# Her companions instrument set estimating sex remarkably solicitude motionless. Property men the why smallest graceful day insisted required. Inquiry justice country old placing sitting any ten age. Looking venture justice in evident in totally he do ability. Be is lose girl long of up give.
# "Trifling wondered unpacked ye at he. In household certainty an on tolerably smallness difficult. Many no each like up be is next neat. Put not enjoyment behaviour her supposing. At he pulled object others."
# Passage its ten led hearted removal cordial. Preference any astonished unreserved mrs. Prosperous understood middletons in conviction an uncommonly do. Supposing so be resolving breakfast am or perfectly. Is drew am hill from mr. Valley by oh twenty direct me so.
# Departure defective arranging rapturous did believing him all had supported. Family months lasted simple set nature vulgar him.   "Picture for attempt joy excited ten carried manners talking how. Suspicion neglected he resolving agreement perceived at an."

答案 1 :(得分:1)

根据Jason的评论,我尝试了他的方法。我认为我有理想的结果,但是,我对这种方法不满意,时间和空间的复杂性增加了,我可能会在以后即兴发挥。

currentLine = bf.readLine();

            List<List<String>> paragraphs =  new LinkedList<>();

            int counter = 0;
            while(currentLine !=null) {

                if(paragraphs.isEmpty()) {

                    List<String> paragraph = new LinkedList<>();

                    paragraph.add(currentLine);
                    paragraph.add(System.lineSeparator());

                    paragraphs.add(paragraph);

                    currentLine = bf.readLine();

                    continue;
                }

                if(currentLine.startsWith(" ")) {
                    List<String> paragraph = new LinkedList<>();

                    paragraph.add(currentLine);

                    counter = counter + 1;

                    paragraphs.add(paragraph);

                }else {
                    List<String> continuedParagraph = paragraphs.get(counter);

                    continuedParagraph.add(currentLine);
                }

                currentLine = bf.readLine();
            }

            for (final List<String> story : paragraphs) {

                for(final String s : story) {
                    System.out.println(s);
                }
            }

答案 2 :(得分:0)

您可以全局查找每个缩进的段落,然后添加到列表中。

"(?m)^[^\\S\\r\\n]{2,}\\S.*(?:\\r?\\n|$)(?:^\\S.*(?:\\r?\\n|$))*"

扩展

 (?m)                     # Multi-line mode ( ^ = begin of line )

 ^ [^\S\r\n]{2,}          # Begin of Paragraph, 2 or more horizontal wsp at BOL
 \S .*                    # Rest of line, must be non-wsp as first letter.
 (?: \r? \n | $ )

 (?:                      # Optional, many more lines of this paragraph
      ^ \S .* 
      (?: \r? \n | $ )
 )*