用正则表达式提取数据

时间:2015-08-14 17:11:23

标签: java regex

我在这里得到了一个很好的解决方案,但正则表达式将字符串拆分为“”字符串以及我需要的其他2个分割。

String  Result = "<ahref=https://blabla.com/Securities_regulation_in_the_United_States>Securities regulation in the United States</a> - Securities regulation in the United States is the field of U.S. law that covers transactions and other dealings with securities.";

String [] Arr =  Result.split("<[^>]*>");
for (String elem : Arr) {
    System.out.printf(elem);
}

结果是:

Arr[0]= ""
Arr[1]= Securities regulation in the United States
Arr[2]= Securities regulation in the United States is the field of U.S. law that covers transactions and other dealings with securities.

Arr[1]Arr[2]拆分很好我只能摆脱Arr[0]

2 个答案:

答案 0 :(得分:2)

你可以使用相反的正则表达式来捕获你想要的东西,使用这样的正则表达式:

(?s)(?:^|>)(.*?)(?:<|$)

<强> Working demo

<强> IDEOne Code working

代码:

String line = "ahref=https://blabla.com/Securities_regulation_in_the_United_States>Securities regulation in the United States</a> - Securities regulation in the United States is the field of U.S. law that covers transactions and other dealings with securities.";

Pattern pattern = Pattern.compile("(?s)(?:^|>)(.*?)(?:<|$)");
Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
    System.out.println("group 1: " + matcher.group(1));
}

答案 1 :(得分:1)

如果仅使用split,则无法避免该空字符串,尤其是因为正则表达式不是零长度。

您可以尝试删除在输入开头放置的第一个匹配项,然后拆分其余匹配项,例如

String[] Arr =  Result.replaceFirst("^<[^>]+>","").split("<[^>]+>")

但通常你应该avoid using regex with HTML\XML Try using parser instead,如Jsoup