使用java在匹配之间提取文本

时间:2015-06-16 13:32:30

标签: java

这是我的输入文字

    1. INTRODUCTION
    This is a test document. This document lines can span multiple lines.
    This is another line.
    2. PROCESS
    This is a test process. This is another line.
    3. ANOTHER HEADING
    ...

我想在主标题1,2,3之间提取文本等等。我正在使用此正则表达式来匹配标题 - ^[ ]{0,2}?[0-9]{0,2}\\.(.*)$

如何在匹配项之间提取文本?

编辑

我尝试使用此代码 -

while(matcher.find()) {
}

如果我在这个while循环中展望下一个匹配的起始索引,它将改变匹配器的状态。如何在使用String.substring之间获取文字?我需要在当前比赛的结尾和下一场比赛的开始时做一个子串。

2 个答案:

答案 0 :(得分:1)

  

如何在匹配项之间提取文本?

你的意思是1.介绍和2.过程等?如果是这样,如果下一行不是“标题”行,则将文本添加到某个缓冲区。如果是标头,请将缓冲区添加到运行列表中,然后清除缓冲区。

像(伪代码)

之类的东西
List<String> content 
currentContent = ""
while line = readNextLine() 
   if not matched header
      currentContent += line
   else  
      // found new header, clear the content and add it to the list
      if currentContent != "" 
         content.add(currentContent)
         currentContent = ""

编辑:作为一个大字符串

// Split the lines by new lines
String[] bits = yourString.split("\\n");

String currentContent = "";    // Text between headers
List<String> content = new ArrayList<String>();       // Running list of text between headers

// Loop through each line
for (String bit : bits) {
    Matcher m = yourPattern.match(bit);
    if (m.matches()) {
       // Found a header
       if (currentContent.length() != 0) {
          content.add(currentContent);
          currentContent = "";
       }
    } else {
       // Not a header, just append the line
       currentContent += bit;
    }
}

这样的东西会起作用。我想你可以做一个复杂的多行正则表达式,但这对我来说似乎更容易

答案 1 :(得分:0)

这个怎么样:

    String text =
        " 1. INTRODUCTION\n"
        + " This is a test document. This document lines can span multiple lines.\n"
        + " This is another line.\n"
        + " 2. PROCESS\n"
        + " This is a test process. This is another line.\n"
        + " 3. ANOTHER HEADING\n";
    Pattern pat = Pattern.compile("^[ ]{0,2}?[0-9]{0,2}\\.(.*)$", Pattern.MULTILINE);
    Matcher m = pat.matcher(text);
    int start = 0;
    while (m.find()) {
        if (start < m.start()) {
            System.out.println("*** paragraphs:");
            System.out.println(text.substring(start, m.start()));
        }
        System.out.println("*** title:");
        System.out.println(m.group());
        start = m.end();
    }

结果是:

*** title:
 1. INTRODUCTION
*** paragraphs:

 This is a test document. This document lines can span multiple lines.
 This is another line.

*** title:
 2. PROCESS
*** paragraphs:

 This is a test process. This is another line.

*** title:
 3. ANOTHER HEADING

您可能希望删除段落前后的换行符。