Question

我正在尝试找一个正则表达式从数据集中分离出作者和书名信息。

这个似乎工作正常：

^\s*(?:(.*)\s+-\s+)?'?([^']+'?.*)\s*$

在下面的数据中，它将group 1 中的作者标识为第一个连字符前面的文本，如果没有连字符，则标识一个书名。 group 2 ：

William Faulkner - 'Light In August'
William Faulkner - 'Sanctuary'
William Faulkner - 'The Sound and the Fury'
Saki - 'Esme'
Saki - 'The Unrest Cure' (Second Edition)
Saki (File Under: Hector Hugh Munro) - 'The Interlopers' (Anniversary Multi-pack)
William Faulkner - 'The Sound and the Fury' (Collector's Re-issue)
'The Sound and the Fury'
The Sound and the Fury
The Bible (St James Version)

但是，如果以下字符串包含＆符号，则会失败：

'Jim Clarke & Oscar Wilde'

有人可以解释为什么它在这里不起作用吗？

更新：

以下是相关的Java代码：

Pattern pattern = Pattern.compile("^\\s*(?:(.*)\\s+-\\s+)?'?([^']+'?.*)\\s*$");
Matcher matcher = pattern.matcher(text);
if(!matcher.matches()) 
{
    logFailure(text);
}
else
{
    String author = matcher.group(1).trim();
    String bookTitle = matcher.group(2).trim();
}

从以上摘录中的以下行引出NullPointerException：

    String author = matcher.group(1).trim();

Answer 1

如果没有连字符，

matcher.group(1)将返回null，因此.trim()正在抛出NPE。

您当前的正则表达式也会吃掉它找到的第一个单引号。另外，你真的想不匹配吗？你只是在那里登录。如果text实际上不必匹配模式，则可以使用更简单的算法。

int hyphenIndex = text.indexOf("-");
if (hyphenIndex > -1) {
    String author = text.substring(0, hyphenIndex);
    System.out.println(author);
}
String title = text.substring(hyphenIndex + 1, text.length());
System.out.println(title);

但是，如果您确实要求拒绝某些字符串，那么您可以采取一些措施来使其更具可读性。

将正则表达式更改为"^(?:(.*)\\s+-\\s+)?'?([^']+'?.*)$"并致电pattern.matcher(text.trim())

Answer 2

group（1）可以返回null，你应该在修剪之前检查

Answer 3

你的正则表达式工作得很好，只是你给出的例子中没有作者，因此第一个匹配的组是null。因此，当您尝试调用matcher.group（1）.trim（）时，您将获得一个NPE。

在调用trim之前只需处理空值。也许是这样的：

String author = matcher.group(1);
if(author == null) {
  author = "";
}
author = author.trim();

为什么这个正则表达式在单个用例上失败 - 一个包含符号的文本字符串？

3 个答案: