Question

您好我是Regex世界的新手。我想在Java中的测试字符串中提取时间戳，位置和“id_str”字段。

20110302140010915|{"user":{"is_translator":false,"show_all_inline_media":false,"following":null,"geo_enabled":true,"profile_background_image_url":"http:\/\/a3.twimg.com\/a\/1298918947\/images\/themes\/theme1\/bg.png","listed_count":0,"favourites_count":2,"verified":false,"time_zone":"Mountain Time (US & Canada)","profile_text_color":"333333","contributors_enabled":false,"statuses_count":152,"profile_sidebar_fill_color":"DDEEF6","id_str":"207356721","profile_background_tile":false,"friends_count":14,"followers_count":13,"created_at":"Mon Oct 25 04:05:43 +0000 2010","description":null,"profile_link_color":"0084B4","location":"WaKeeney, KS","profile_sidebar_border_color":"C0DEED",

我试过这个

(\d*).*?"id_str":"(\d*)",.*"location":"([^"]*)"

如果我使用延迟量词.*?（regexbuddy中的3000步），它有很多回溯，但锚“id_str”和“location”之间的字符数并不总是相同。此外，如果在字符串中找不到位置，则可能是灾难性的。

我该如何避免 1）不必要的回溯？

和

2）更快找到不匹配的字符串？

感谢。

Answer 1

这看起来像JSON，相信我很容易以这种方式解析它。

String[] input = inputStr.split("|", 2);
System.out.println("Timestamp: " + input[0]); // 20110302140010915

JSONObject user = new JSONObject(input[1]).getJSONObject("user");

System.out.println ("ID: " + user.getString("id_str")); // 207356721
System.out.println ("Location: " + user.getString("location")); // WaKeeney, KS

<强> 参考：
JSON Java API docs

Answer 2

你可以试试这个：

(\d*+)(?>[^"]++|"(?!id_str":))+"id_str":"(\d*+)",(?>[^"]++|"(?!location":))+"location":"([^"]*+)"

这里的想法是尽可能地仅使用具有受限字符类的possessive quantifiers和atomic groups消除回溯（正如您在上一个捕获组中所做的那样）

示例，为了避免第一个惰性量词，我使用：

(?>[^"]++|"(?!id_str":))+

正则表达式引擎将尽可能多地使用非双引号的所有字符（并且没有注册单个回溯位置，因为使用了占有量词），当发现双引号时，前瞻检查是否不是然后是锚id_str":。所有这一部分都被原子组包裹（内部没有回溯）重复一次或多次。

不要害怕使用超前快速失败的前瞻，并且只有在找到双引号时才会感到害怕。但是，如果您确定它不如i（或之前的罕见字符，如果您找到），则可以使用"尝试相同的操作：

(?>[^i]++|i(?!d_str":))+id_str":(...

编辑：这里的最佳选择似乎是,不那么频繁：（200步与422引用双引号）

(\d*+)(?>[^,]++|,(?!"id_str":))+,"id_str":"(\d*+)",(?>[^,]++|,(?!"location":))+,"location":"([^"]*+)"

要获得更好的性能，如果有可能，请尝试在模式中添加锚点（^），如果它是字符串或换行符的开头（使用多行模式）。 / p>

^(\d*+)(?>[^"]++|"(?!id_str":))+"id_str":"(\d*+)",(?>[^"]++|"(?!location":))+"location":"([^"]*+)"

正则表达式避免了Java中不必要的回溯

2 个答案: