Question

我正在尝试在文本文件中提取给定的模式，但是，结果不是我想要的100％。

这是我的代码：

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class ParseText1 {

public static void main(String[] args) {

    String content = "<p>Yada yada yada <code> foo ddd</code>yada yada ...\n"
        + "more here <2004-08-24> bar<Bob Joe> etc etc\n"
        + "more here again <2004-09-24> bar<Bob Joe> <Fred Kej> etc etc\n"
        + "more here again <2004-08-24> bar<Bob Joe><Fred Kej> etc etc\n"
        + "and still more <2004-08-21><2004-08-21> baz <John Doe> and now <code>the end</code> </p>\n";

    Pattern p = Pattern
    .compile("<[1234567890]{4}-[1234567890]{2}-[1234567890]{2}>.*?<[^%0-9/]*>",
            Pattern.MULTILINE);

    Matcher m = p.matcher(content);

    // print all the matches that we find
    while (m.find()) {

        System.out.println(m.group());

    }

}
}

我得到的输出是：

<2004-08-24> bar<Bob Joe>
<2004-09-24> bar<Bob Joe> <Fred Kej>
<2004-08-24> bar<Bob Joe><Fred Kej>
<2004-08-21><2004-08-21> baz <John Doe> and now <code>

我想要的输出是：

<2004-08-24> bar<Bob Joe>
<2004-08-24> bar<Bob Joe>
<2004-08-24> bar<Bob Joe>
<2004-08-21> baz <John Doe>

简而言之，必须提取“日期”，“文本（或空白）”和“名称”的顺序。应该避免其他一切。例如，标签“Fred Kej”之前没有任何“date”标签，因此，它应该被标记为无效。

此外，作为一个附带问题，有没有办法存储或跟踪被跳过/拒绝的文本片段，以及有效文本。

谢谢，Brian

Answer 1

此模式有效："<\\d{4}-\\d{2}-\\d{2}>[^<]*<[^%\\d>]*>"

至于捕获不匹配的字符串，我认为使用Matcher.start()和end()索引并从原始文本中提取子字符串要比使用模式更容易，这已经非常复杂了

String content = "<p>Yada yada yada <code> foo ddd</code>yada yada ...\n"
    + "more here <2004-08-24> bar<Bob Joe> etc etc\n"
    + "more here again <2004-09-24> bar<Bob Joe> <Fred Kej> etc etc\n"
    + "more here again <2004-08-24> bar<Bob Joe><Fred Kej> etc etc\n"
    + "and still more <2004-08-21><2004-08-21> baz <John Doe> and now <code>the end</code> </p>\n";

Pattern p = Pattern.compile(
    "<\\d{4}-\\d{2}-\\d{2}>[^<]*<[^%\\d>]*>",
    Pattern.MULTILINE
);

Matcher m = p.matcher(content);
int index = 0;
while (m.find()) {
    System.out.println(content.substring(index, m.start()));
    System.out.println("**MATCH START**" + m.group() + "**MATCH END**");
    index = m.end();
}
System.out.println(content.substring(index));

打印：

<p>Yada yada yada <code> foo ddd</code>yada yada ...
more here 
**MATCH START**<2004-08-24> bar<Bob Joe>**MATCH END**
 etc etc
more here again 
**MATCH START**<2004-09-24> bar<Bob Joe>**MATCH END**
 <Fred Kej> etc etc
more here again 
**MATCH START**<2004-08-24> bar<Bob Joe>**MATCH END**
<Fred Kej> etc etc
and still more <2004-08-21>
**MATCH START**<2004-08-21> baz <John Doe>**MATCH END**
 and now <code>the end</code> </p>

Answer 2

您是否尝试将>字符添加到第二组括号中不允许的内容列表中？

Pattern p = Pattern
    .compile("<[1234567890]{4}-[1234567890]{2}-[1234567890]{2}>.*?<[^%0-9/>]*>",
            Pattern.MULTILINE);

Answer 3

请改用此正则表达式。还添加了代码以回显丢弃的文本片段。

    Pattern p = Pattern.compile(
            "(<[0-9]{4}-[0-9]{2}-[0-9]{2}>)" + // <2004-08-21>
            "([^<]*)" +                        //  baz
            "(<[^%0-9>]*>)",                   // <John Doe>
            Pattern.MULTILINE);

    Matcher m = p.matcher(content);

    // print all the matches that we find
    int start = 0;
    while (m.find()) {
        System.out.println("\t"
                + content.substring(start, m.end()).replaceAll("\n", "\n\t"));
        System.out.println(m.group());
        start = m.end();
    }
    System.out.println("\t"
                + content.substring(start).replaceAll("\n", "\n\t"));

输出

        <p>Yada yada yada <code> foo ddd</code>yada yada ...
        more here <2004-08-24> bar<Bob Joe>
<2004-08-24> bar<Bob Joe>
         etc etc
        more here again <2004-09-24> bar<Bob Joe>
<2004-09-24> bar<Bob Joe>
         <Fred Kej> etc etc
        more here again <2004-08-24> bar<Bob Joe>
<2004-08-24> bar<Bob Joe>
        <Fred Kej> etc etc
        and still more <2004-08-21><2004-08-21> baz <John Doe>
<2004-08-21> baz <John Doe>
         and now <code>the end</code> </p>

缩进行对应于丢弃的片段

试图在字符串中提取模式

3 个答案: