我正在尝试在文本文件中提取给定的模式,但是,结果不是我想要的100%。
这是我的代码:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class ParseText1 {
public static void main(String[] args) {
String content = "<p>Yada yada yada <code> foo ddd</code>yada yada ...\n"
+ "more here <2004-08-24> bar<Bob Joe> etc etc\n"
+ "more here again <2004-09-24> bar<Bob Joe> <Fred Kej> etc etc\n"
+ "more here again <2004-08-24> bar<Bob Joe><Fred Kej> etc etc\n"
+ "and still more <2004-08-21><2004-08-21> baz <John Doe> and now <code>the end</code> </p>\n";
Pattern p = Pattern
.compile("<[1234567890]{4}-[1234567890]{2}-[1234567890]{2}>.*?<[^%0-9/]*>",
Pattern.MULTILINE);
Matcher m = p.matcher(content);
// print all the matches that we find
while (m.find()) {
System.out.println(m.group());
}
}
}
我得到的输出是:
<2004-08-24> bar<Bob Joe>
<2004-09-24> bar<Bob Joe> <Fred Kej>
<2004-08-24> bar<Bob Joe><Fred Kej>
<2004-08-21><2004-08-21> baz <John Doe> and now <code>
我想要的输出是:
<2004-08-24> bar<Bob Joe>
<2004-08-24> bar<Bob Joe>
<2004-08-24> bar<Bob Joe>
<2004-08-21> baz <John Doe>
简而言之,必须提取“日期”,“文本(或空白)”和“名称”的顺序。应该避免其他一切。例如,标签“Fred Kej”之前没有任何“date”标签,因此,它应该被标记为无效。
此外,作为一个附带问题,有没有办法存储或跟踪被跳过/拒绝的文本片段,以及有效文本。
谢谢,Brian
答案 0 :(得分:3)
此模式有效:"<\\d{4}-\\d{2}-\\d{2}>[^<]*<[^%\\d>]*>"
至于捕获不匹配的字符串,我认为使用Matcher.start()
和end()
索引并从原始文本中提取子字符串要比使用模式更容易,这已经非常复杂了
String content = "<p>Yada yada yada <code> foo ddd</code>yada yada ...\n"
+ "more here <2004-08-24> bar<Bob Joe> etc etc\n"
+ "more here again <2004-09-24> bar<Bob Joe> <Fred Kej> etc etc\n"
+ "more here again <2004-08-24> bar<Bob Joe><Fred Kej> etc etc\n"
+ "and still more <2004-08-21><2004-08-21> baz <John Doe> and now <code>the end</code> </p>\n";
Pattern p = Pattern.compile(
"<\\d{4}-\\d{2}-\\d{2}>[^<]*<[^%\\d>]*>",
Pattern.MULTILINE
);
Matcher m = p.matcher(content);
int index = 0;
while (m.find()) {
System.out.println(content.substring(index, m.start()));
System.out.println("**MATCH START**" + m.group() + "**MATCH END**");
index = m.end();
}
System.out.println(content.substring(index));
打印:
<p>Yada yada yada <code> foo ddd</code>yada yada ...
more here
**MATCH START**<2004-08-24> bar<Bob Joe>**MATCH END**
etc etc
more here again
**MATCH START**<2004-09-24> bar<Bob Joe>**MATCH END**
<Fred Kej> etc etc
more here again
**MATCH START**<2004-08-24> bar<Bob Joe>**MATCH END**
<Fred Kej> etc etc
and still more <2004-08-21>
**MATCH START**<2004-08-21> baz <John Doe>**MATCH END**
and now <code>the end</code> </p>
答案 1 :(得分:0)
您是否尝试将>
字符添加到第二组括号中不允许的内容列表中?
Pattern p = Pattern
.compile("<[1234567890]{4}-[1234567890]{2}-[1234567890]{2}>.*?<[^%0-9/>]*>",
Pattern.MULTILINE);
答案 2 :(得分:0)
请改用此正则表达式。还添加了代码以回显丢弃的文本片段。
Pattern p = Pattern.compile(
"(<[0-9]{4}-[0-9]{2}-[0-9]{2}>)" + // <2004-08-21>
"([^<]*)" + // baz
"(<[^%0-9>]*>)", // <John Doe>
Pattern.MULTILINE);
Matcher m = p.matcher(content);
// print all the matches that we find
int start = 0;
while (m.find()) {
System.out.println("\t"
+ content.substring(start, m.end()).replaceAll("\n", "\n\t"));
System.out.println(m.group());
start = m.end();
}
System.out.println("\t"
+ content.substring(start).replaceAll("\n", "\n\t"));
输出
<p>Yada yada yada <code> foo ddd</code>yada yada ...
more here <2004-08-24> bar<Bob Joe>
<2004-08-24> bar<Bob Joe>
etc etc
more here again <2004-09-24> bar<Bob Joe>
<2004-09-24> bar<Bob Joe>
<Fred Kej> etc etc
more here again <2004-08-24> bar<Bob Joe>
<2004-08-24> bar<Bob Joe>
<Fred Kej> etc etc
and still more <2004-08-21><2004-08-21> baz <John Doe>
<2004-08-21> baz <John Doe>
and now <code>the end</code> </p>
缩进行对应于丢弃的片段