如何使用正则表达式从头到尾匹配块

时间:2017-02-21 15:19:00

标签: java regex

我想从起始标题到结束标题获取整个块,但不包括结束标题。例如:

<section1>
Base_Currency=EUR
Description=Revaluation
Grouping_File
<section2>

匹配结果应为:

<section1>
Base_Currency=EUR
Description=Revaluation
Grouping_File

问题是如何在java中使用Regex为此匹配制定模式?

2 个答案:

答案 0 :(得分:2)

如果您的整个输入采用此格式,则可以简单地拆分:

String[] sections = input.split("\\R(?=<)");

\R是&#34;任何换行序列&#34; (?=<)表示&#34;下一个字符是'<'&#34;。

但是,如果情况并非如此,那么您需要使用正则表达式工具箱:

  • DOTALL标志,因此点也匹配换行符
  • MULTILINE标志,因此^也匹配行首
  • 一个负向的前瞻,所以你在下一节的开头停止消费

假设&#34;部分&#34;从&#34;&lt;&#34;开始在一行的开头:

"(?sm)^<\\w+>(.(?!^<))*"

以下是如何使用它的:

String input = "<section1>\nBase_Currency=EUR\nDescription=Revaluation\nGrouping_File\n<section2>\nfoo";
Matcher matcher = Pattern.compile("(?sm)^<\\w+>(.(?!^<))*").matcher(input);
while (matcher.find()) {
    String section = matcher.group();
}

答案 1 :(得分:1)

如果您输入的内容如下

<section1>
Base_Currency=EUR
Description=Revaluation
Grouping_File
<section2>
Base_Currency=EUR
Description=Revaluation
Grouping_File
<section3>
Base_Currency=EUR
Description=Revaluation
Grouping_File

然后您可以使用以下正则表达式

(?s)(<section\d+>.*?)(?=<section\d+>|$)

正则表达式的解释是

NODE                     EXPLANATION
--------------------------------------------------------------------------------
  (?s)                     set flags for this block (with . matching
                           \n) (case-sensitive) (with ^ and $
                           matching normally) (matching whitespace
                           and # normally)
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    <section                 '<section'
--------------------------------------------------------------------------------
    \d+                      digits (0-9) (1 or more times (matching
                             the most amount possible))
--------------------------------------------------------------------------------
    >                        '>'
--------------------------------------------------------------------------------
    .*?                      any character (0 or more times (matching
                             the least amount possible))
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  (?=                      look ahead to see if there is:
--------------------------------------------------------------------------------
    <section                 '<section'
--------------------------------------------------------------------------------
    \d+                      digits (0-9) (1 or more times (matching
                             the most amount possible))
--------------------------------------------------------------------------------
    >                        '>'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    $                        before an optional \n, and the end of
                             the string
--------------------------------------------------------------------------------
  )                        end of look-ahead

如果您只想匹配一个标签,则可以使用

(?s)(<section\d+>[^<]*)

此正则表达式的说明是

NODE                     EXPLANATION
--------------------------------------------------------------------------------
  (?s)                     set flags for this block (with . matching
                           \n) (case-sensitive) (with ^ and $
                           matching normally) (matching whitespace
                           and # normally)
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    <section                 '<section'
--------------------------------------------------------------------------------
    \d+                      digits (0-9) (1 or more times (matching
                             the most amount possible))
--------------------------------------------------------------------------------
    >                        '>'
--------------------------------------------------------------------------------
    [^<]*                    any character except: '<' (0 or more
                             times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
  )                        end of \1