Question

我想从起始标题到结束标题获取整个块，但不包括结束标题。例如：

<section1>
Base_Currency=EUR
Description=Revaluation
Grouping_File
<section2>

匹配结果应为：

<section1>
Base_Currency=EUR
Description=Revaluation
Grouping_File

问题是如何在java中使用Regex为此匹配制定模式？

Answer 1

如果您的整个输入采用此格式，则可以简单地拆分：

String[] sections = input.split("\\R(?=<)");

\R是＆＃34;任何换行序列＆＃34; (?=<)表示＆＃34;下一个字符是'<'＆＃34;。

但是，如果情况并非如此，那么您需要使用正则表达式工具箱：

DOTALL标志，因此点也匹配换行符
MULTILINE标志，因此^也匹配行首
一个负向的前瞻，所以你在下一节的开头停止消费

假设＆＃34;部分＆＃34;从＆＃34;＆lt;＆＃34;开始在一行的开头：

"(?sm)^<\\w+>(.(?!^<))*"

以下是如何使用它的：

String input = "<section1>\nBase_Currency=EUR\nDescription=Revaluation\nGrouping_File\n<section2>\nfoo";
Matcher matcher = Pattern.compile("(?sm)^<\\w+>(.(?!^<))*").matcher(input);
while (matcher.find()) {
    String section = matcher.group();
}

Answer 2

如果您输入的内容如下

<section1>
Base_Currency=EUR
Description=Revaluation
Grouping_File
<section2>
Base_Currency=EUR
Description=Revaluation
Grouping_File
<section3>
Base_Currency=EUR
Description=Revaluation
Grouping_File

然后您可以使用以下正则表达式

(?s)(<section\d+>.*?)(?=<section\d+>|$)

正则表达式的解释是

NODE                     EXPLANATION
--------------------------------------------------------------------------------
  (?s)                     set flags for this block (with . matching
                           \n) (case-sensitive) (with ^ and $
                           matching normally) (matching whitespace
                           and # normally)
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    <section                 '<section'
--------------------------------------------------------------------------------
    \d+                      digits (0-9) (1 or more times (matching
                             the most amount possible))
--------------------------------------------------------------------------------
    >                        '>'
--------------------------------------------------------------------------------
    .*?                      any character (0 or more times (matching
                             the least amount possible))
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  (?=                      look ahead to see if there is:
--------------------------------------------------------------------------------
    <section                 '<section'
--------------------------------------------------------------------------------
    \d+                      digits (0-9) (1 or more times (matching
                             the most amount possible))
--------------------------------------------------------------------------------
    >                        '>'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    $                        before an optional \n, and the end of
                             the string
--------------------------------------------------------------------------------
  )                        end of look-ahead

如果您只想匹配一个标签，则可以使用

(?s)(<section\d+>[^<]*)

此正则表达式的说明是

NODE                     EXPLANATION
--------------------------------------------------------------------------------
  (?s)                     set flags for this block (with . matching
                           \n) (case-sensitive) (with ^ and $
                           matching normally) (matching whitespace
                           and # normally)
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    <section                 '<section'
--------------------------------------------------------------------------------
    \d+                      digits (0-9) (1 or more times (matching
                             the most amount possible))
--------------------------------------------------------------------------------
    >                        '>'
--------------------------------------------------------------------------------
    [^<]*                    any character except: '<' (0 or more
                             times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
  )                        end of \1

如何使用正则表达式从头到尾匹配块

2 个答案: