我想按字母顺序捕获指定的日期。可以是以下表格之一
此外,它们将在句子中出现。 例如
“我们可以在下午的某个时间见面。”
我在java中使用以下正则表达式
((?<month>jan(uary)?|feb(ruary)?|mar(ch)?|apr(il)?|may|jun(e)?|jul(y)?|aug(ust)?|sep(t?|tember)?|oct(ober)?|nov(ember)?|dec(ember)?)((\\s+)?(?<date>\\d+)?(st|nd|rd|th))?(\\s+)?,?(\\s+)?(?<year>(20)\\d\\d)?)
((?<date>\\d+)?(st|nd|rd|th)?\\s+(?<month>jan(uary)?|feb(ruary)?|mar(ch)?|apr(il)?|may|jun(e)?|jul(y)?|aug(ust)?|sep(t?|tember)?|oct(ober)?|nov(ember)?|dec(ember)?)(\\s+)?,?(?<year>(19|20)\\d\\d)?)
我需要在捕获正则表达式后指出字符串中标记的确切位置。
当我查看Matcher.end()返回的索引时,似乎我的表达式也捕获了 一月后的空间。我确实希望捕获像“Jan 1st”这样的表达式,但只有当下一个捕获组匹配时才可以。
是否可以修改上面的正则表达式来执行此操作?
答案 0 :(得分:2)
扩展模式以提高可读性:
(
(?<month>
jan(uary)?
| feb(ruary)?
| mar(ch)?
| apr(il)?
| may
| jun(e)?
| jul(y)?
| aug(ust)?
| sep(t?|tember)?
| oct(ober)?
| nov(ember)?
| dec(ember)?
)
(
(\\s+)?
(?<date>\\d+)?
(st|nd|rd|th)
)?
(\\s+)?
,?
(\\s+)?
(?<year>(20)\\d\\d)?
)
即使年份没有,年内的空间也可以匹配。此外,即使日期没有,日期后缀也可以匹配。
清理并修复我得到的模式:
\\b
(?<month>
jan(uary)?
| feb(ruary)?
| mar(ch)?
| apr(il)?
| may
| jun(e)?
| jul(y)?
| aug(ust)?
| sep(t|tember)?
| oct(ober)?
| nov(ember)?
| dec(ember)?
)
(
\\s*
(?<date>\\d+)
(st|nd|rd|th)?
)?
(
\\s*
,?
\\s*
(?<year>(19|20)\\d\\d)
)?
\\b
我删除了外部组,因为无论如何你都将它作为组0。 t?
中的sep(t?|tember)?
已更改为t
。所有(\\s+)?
都已更改为等效的\\s*
。我将?
从(?<date>\\d+)?
移到(st|nd|rd|th)
。我把这一年包裹在一个小组中,并将?
从(?<year>20\\d\\d)
移到了那里。我添加了单词边界(\\b
),因此它不会在单词的中间开始或结束。
作为一行:
\\b(?<month>jan(uary)?|feb(ruary)?|mar(ch)?|apr(il)?|may|jun(e)?|jul(y)?|aug(ust)?|sep(t|tember)?|oct(ober)?|nov(ember)?|dec(ember)?)(\\s*(?<date>\\d+)(st|nd|rd|th)?)?(\\s*,?\\s*(?<year>(19|20)\\d\\d))?\\b
将它与你的第二种模式结合起来:
\\b
(
(?<month1>
jan(uary)?
| feb(ruary)?
| mar(ch)?
| apr(il)?
| may
| jun(e)?
| jul(y)?
| aug(ust)?
| sep(t|tember)?
| oct(ober)?
| nov(ember)?
| dec(ember)?
)
(
\\s*
(?<date1>\\d+)
(st|nd|rd|th)?
)?
|
(?<date2>\\d+)
(st|nd|rd|th)?
\\s*
(?<month2>
jan(uary)?
| feb(ruary)?
| mar(ch)?
| apr(il)?
| may
| jun(e)?
| jul(y)?
| aug(ust)?
| sep(t|tember)?
| oct(ober)?
| nov(ember)?
| dec(ember)?
)
)
(
\\s*
,?
\\s*
(?<year>(19|20)\\d\\d)
)?
\\b
作为一行:
\\b((?<month1>jan(uary)?|feb(ruary)?|mar(ch)?|apr(il)?|may|jun(e)?|jul(y)?|aug(ust)?|sep(t|tember)?|oct(ober)?|nov(ember)?|dec(ember)?)(\\s*(?<date1>\\d+)(st|nd|rd|th)?)?|(?<date2>\\d+)(st|nd|rd|th)?\\s*(?<month2>jan(uary)?|feb(ruary)?|mar(ch)?|apr(il)?|may|jun(e)?|jul(y)?|aug(ust)?|sep(t|tember)?|oct(ober)?|nov(ember)?|dec(ember)?))(\\s*,?\\s*(?<year>(19|20)\\d\\d))?\\b
答案 1 :(得分:1)
另一个版本:
static private String month = "(?<month>jan(uary)?|feb(ruary)?|mar(ch)?|apr(il)?|may|jun(e)?|jul(y)?|aug(ust)?|sep(t|tember)?|oct(ober)?|nov(ember)?|dec(ember)?)";
static private String suffix = "(?:st|nd|rd|th)";
static private String date = "(?<date>\\d{1,2})";
static private String year = "(?<year>\\d{4})";
// A month name (optionally followed by space followed by a date (optionally
// followed by a suffix or space and a comma) (optionally followed by space
// followed by a year))
static private String order1 = String.format(
"%s(?:\\s+%s(?:%s|\\s+,)?(?:\\s+%s)?)?", month, date, suffix,
year);
// A date followed by a suffix followed by a month (optionally followed by
// space and a comma) optionally followed by space and a year
static private String order2 = String.format(
"%s%s\\s+%s(?:\\s+,)?(?:\\s+%s)?", date, suffix, month, year);
是的,String.format
没有太多理由,但是因为它是static
,所以它不应该是残酷的性能,并且它使正则表达式比任何其他方式都更容易阅读我能用Java思考。
它匹配所有示例模式(并获得正确的输出,IIRC),包括句子中的版本。您可能遇到的唯一问题是,它会在“让我们在1月1日见面,好吗?”之后立即吃掉逗号,但如果写的是“让我们在1月1日见面,那就不会与逗号相符”,好吧?” (当我说“匹配逗号”时,我的意思是整个正则表达式将采用逗号,尽管命名的捕获将是正确的)。我确实将年份改为简单匹配四位数。我还将日期更改为仅匹配一个或两个数字。就像@MarkusJarderot一样,我将“九月”改为没有可选的“t”,因为整个后缀是可选的。我已经尝试编写两个正则表达式,以便添加和删除逻辑块 - 与下面的版本进行比较,并注意我是如何在不重写整个表达式的情况下更改它的。 需要注意的事项:在某些情况下,两个正则表达式都匹配(order1只匹配单个月,order2匹配表单“1st Jan”的日期)。在这种情况下,您可能想知道如何选择要遵循的表达式。
现在,编写这些正则表达式是为了避免匹配任何不提供格式的日期。我建议修改它们以允许以下形式(#表示原始列表中的项目):
Jan //(已经原始示例支持)
1月1日
此版本的代码支持上述表单。它也更好:已经转换为使用所有非捕获模式的月份(因此无法无缘无故地创建额外的捕获),并且我已根据@ MarkusJarderot的答案删除了整个正则表达式的捕获。扩展的日期格式数也允许使用较少扭曲的正则表达式。这些表单引入的一个小问题是,现在v1
将尝试将“1 Jan 2013”形式的日期与“Jan 20”匹配,而v2
正确匹配它们。这就是我上面提到的“需要注意的事情”;你可能想弄清楚如何决定使用哪个正则表达式(尝试两者并使用匹配更多日期部分的那个)。
static private String month = "(?<month>jan(?:uary)?|feb(?:ruary)?|mar(?:ch)?|apr(?:il)?|may|jun(?:e)?|jul(?:y)?|aug(?:ust)?|sep(?:t|tember)?|oct(?:ober)?|nov(?:ember)?|dec(?:ember)?)";
static private String suffix = "(?:st|nd|rd|th)";
static private String date = "(?<date>\\d{1,2})";
static private String year = "(?<year>\\d{4})";
// A month name (optionally followed by space followed by a date (optionally
// followed by a suffix)(optionally followed by a comma, possibly with space
// before it)(optionally followed by space followed
// by a year))
static private String v1 = String.format(
"%s(?:\\s+%s%s?(?:\\s*,)?(?:\\s+%s)?)?", month, date, suffix, year);
// A date (optionally followed by a suffix) followed by space followed by a
// month (optionally followed by
// a comma, possibly with space before it) optionally followed by space and
// a year
static private String v2 = String.format(
"%s%s?\\s+%s(?:\\s*,)?(?:\\s+%s)?", date, suffix, month, year);
或者,作为没有Java的正则表达式(format
的输出):
(?<month>jan(?:uary)?|feb(?:ruary)?|mar(?:ch)?|apr(?:il)?|may|jun(?:e)?|jul(?:y)?|aug(?:ust)?|sep(?:t|tember)?|oct(?:ober)?|nov(?:ember)?|dec(?:ember)?)(?:\s+(?<date>\d{1,2})(?:st|nd|rd|th)?(?:\s*,)?(?:\s+(?<year>\d{4}))?)?
(?<date>\d{1,2})(?:st|nd|rd|th)?\s+(?<month>jan(?:uary)?|feb(?:ruary)?|mar(?:ch)?|apr(?:il)?|may|jun(?:e)?|jul(?:y)?|aug(?:ust)?|sep(?:t|tember)?|oct(?:ober)?|nov(?:ember)?|dec(?:ember)?)(?:\s*,)?(?:\s+(?<year>\d{4}))?