如何从段落中提取字符串?

时间:2013-12-07 05:58:37

标签: java regex text-extraction

A 24-year-old youth died on the spot, after his motorcycle
 rammed a divider near Golf market on <LOCATION>BelAir</LOCATION> road 
 Thursday night. The deceased has been identified as
 John(24) hailing from <LOCATION>UK</LOCATION>.

He was originally from <LOCATION>Usa</LOCATION>.

句子是2个不同的段落。我希望输出看起来像:

Para 1:BelAir 
       UK

Para 2:Usa

我已将标记的正则表达式标识为:

<(?<tag>\w*)>(?<text>.*)</\k<tag>>

和段落:

(\n|^).*?(?=\n|$)

有没有办法将这些结合起来?或者我应该使用拆分?

2 个答案:

答案 0 :(得分:0)

检查String是否以'\ n'

开头
while(){//read line
   if(string.startsWith("\n")==false){
     // your regex expration for tags
     // store it in a list
   }
   else{
     // add a null in a List 
   }
}

所以你的列表看起来像

BelAir
US
Null
USA

所以在每个null之后都有一个新的Para

答案 1 :(得分:0)

试试这个

String str = "A 24-year-old youth died on the spot, after his motorcycle " +
            "rammed a divider near Golf market on <LOCATION>BelAir</LOCATION> road" +
            " Thursday night. The deceased has been identified as  John(24) hailing from <LOCATION>UK</LOCATION>." +
            "\n He was originally from <LOCATION>Usa</LOCATION>.";
    String [] paras=str.split("\n"); //Divide the string into two paragraphs
    Pattern pattern = Pattern.compile("<LOCATION>(.*?)</LOCATION>");
        for(int i=0;i<paras.length;i++)
        {
            System.out.print("Para "+(i+1)+": ");
            Matcher matcher = pattern.matcher(paras[i]);
            while (matcher.find()) {
                System.out.println(matcher.group(1));
            }
        }

输出为

Para 1: BelAir
UK
Para 2: Usa