正则表达式检索引用的字符串和引用字符

时间:2015-12-22 23:17:56

标签: java regex

我有一种语言将字符串定义为由单引号或双引号分隔,其中分隔符在字符串中通过加倍来转义。例如,以下所有内容都是合法字符串:

'This isn''t easy to parse.'
'Then John said, "Hello Tim!"'
"This isn't easy to parse."
"Then John said, ""Hello Tim!"""

我有一个字符串集合(上面定义),由不包含引号的东西分隔。我试图用正则表达式做的是解析列表中的每个字符串。例如,这是一个输入:

  

"一些字符串#1"或者'一些字符串#2' AND"一些' String' #3" XOR
  '一些" String" #4' HOWDY"一些"" String"" #5" FOO' Some' String'' #6'

确定字符串是否具有这种形式的正则表达式是微不足道的:

^(?:"(?:[^"]|"")*"|'(?:[^']|'')*')(?:\s+[^"'\s]+\s+(?:"(?:[^"]|"")*"|'(?:[^']|'')*')*

在运行上面的表达式来测试它是否是这种形式之后,我需要另一个正则表达式来从输入中获取每个分隔的字符串。我计划如下:

Pattern pattern = Pattern.compile("What REGEX goes here?");
Matcher matcher = pattern.matcher(inputString);
int startIndex = 0;
while (matcher.find(startIndex))
{
    String quote        = matcher.group(1);
    String quotedString = matcher.group(2);
    ...
    startIndex = matcher.end();
}

我想要一个正则表达式来捕获组#1中的引号字符,以及组#2中引号中的文本(我正在使用Java Regex)。因此,对于上面的输入,我正在寻找一个在每个循环迭代中产生以下输出的正则表达式:

Loop 1: matcher.group(1) = "
        matcher.group(2) = Some String #1
Loop 2: matcher.group(1) = '
        matcher.group(2) = Some String #2
Loop 3: matcher.group(1) = "
        matcher.group(2) = Some 'String' #3
Loop 4: matcher.group(1) = '
        matcher.group(2) = Some "String" #4
Loop 5: matcher.group(1) = "
        matcher.group(2) = Some ""String"" #5
Loop 6: matcher.group(1) = '
        matcher.group(2) = Some ''String'' #6

到目前为止我尝试过的模式(未转义,然后转义为Java代码):

(["'])((?:[^\1]|\1\1)*)\1
"([\"'])((?:[^\\1]|\\1\\1)*)\\1"

(?<quot>")(?<val>(?:[^"]|"")*)"|(?<quot>')(?<val>(?:[^']|'')*)'
"(?<quot>\")(?<val>(?:[^\"]|\"\")*)\"|(?<quot>')(?<val>(?:[^']|'')*)'"

尝试编译模式时,这两个都失败了。

这样的正则表达式可能吗?

5 个答案:

答案 0 :(得分:2)

创建一个与您匹配的实用程序类:

/tmp

答案 1 :(得分:0)

我不确定这是否是您要求的,但您可以编写一些代码来解析字符串并获得所需的结果(引用字符和内部文本)而不是使用常规表达

class Parser {

  public static ParseResult parse(String str)
  throws ParseException {

    if(str == null || (str.length() < 2)){
      throw new ParseException();
    }

    Character delimiter = getDelimiter(str);

    // Remove delimiters
    str = str.substring(1, str.length() -1);

    // Unescape escaped quotes in inner string
    String escapedDelim = "" + delimiter + delimiter;
    str = str.replaceAll(escapedDelim, "" + delimiter);

    return new ParseResult(delimiter, str);
  }

  private static Character getDelimiter(String str)
  throws ParseException {
    Character firstChar = str.charAt(0);
    Character lastChar = str.charAt(str.length() -1);

    if(!firstChar.equals(lastChar)){
      throw new ParseException(String.format(
            "First char (%s) doesn't match last char (%s) for string %s",
           firstChar, lastChar, str
      ));
    }

    return firstChar;
  }

}
class ParseResult {

  public final Character delimiter;
  public final String contents;

  public ParseResult(Character delimiter, String contents){
    this.delimiter = delimiter;
    this.contents = contents;
  }

}
class ParseException extends Exception {

  public ParseException(){
    super();
  }

  public ParseException(String msg){
    super(msg);
  }

}

答案 2 :(得分:0)

使用此正则表达式:

"^('|\")(.*)\\1$"

一些测试代码:

public static void main(String[] args) {
    String[] tests = {
            "'This isn''t easy to parse.'",
            "'Then John said, \"Hello Tim!\"'",
            "\"This isn't easy to parse.\"",
            "\"Then John said, \"\"Hello Tim!\"\"\""};
    Pattern pattern = Pattern.compile("^('|\")(.*)\\1$");
    Arrays.stream(tests).map(pattern::matcher).filter(Matcher::find).forEach(m -> System.out.println("1=" + m.group(1) + ", 2=" + m.group(2)));
}

输出:

1=', 2=This isn''t easy to parse.
1=', 2=Then John said, "Hello Tim!"
1=", 2=This isn't easy to parse.
1=", 2=Then John said, ""Hello Tim!""

如果您对如何在文本中捕获引用文本感兴趣:

此正则表达式匹配所有变体并捕获组1中的引用和组6中的引用文本:

^((')|("))(.*?("\3|")(.*)\5)?.*\1$

请参阅live demo

这是一些测试代码:

public static void main(String[] args) {
    String[] tests = {
            "'This isn''t easy to parse.'",
            "'Then John said, \"Hello Tim!\"'",
            "\"This isn't easy to parse.\"",
            "\"Then John said, \"\"Hello Tim!\"\"\""};
    Pattern pattern = Pattern.compile("^((')|(\"))(.*?(\"\\3|\")(.*)\\5)?.*\\1$");
    Arrays.stream(tests).map(pattern::matcher).filter(Matcher::find)
      .forEach(m -> System.out.println("quote=" + m.group(1) + ", quoted=" + m.group(6)));
}

输出:

quote=', quoted=null
quote=', quoted=Hello Tim!
quote=", quoted=null
quote=", quoted=Hello Tim!

答案 3 :(得分:0)

对这类问题使用正则表达式非常具有挑战性。不使用正则表达式的简单解析器更容易实现,理解和维护。

此外,这样一个简单的解析可以轻松支持反斜杠转义,以及将反斜杠序列转换为字符(例如“\ n”转换为换行符)。

答案 4 :(得分:0)

这可以通过下面的简单正则表达式轻松完成

private static Object[] checkPattern(String name, String regex) {
    List<String> matchedString = new ArrayList<>();
    Pattern pattern = Pattern.compile(regex);
    Matcher matcher = pattern.matcher(name);
    while (matcher.find()) {
        if (matcher.group().length() > 0) {
            matchedString.add(matcher.group());
        }
    }
    return matchedString.toArray();
}


@Test
public void quotedtextMultipleQuotedLines() {
    String text = "He said, \"I am Tom\". She said, \"I am Lisa\".";
    String quoteRegex = "(\"[^\"]+\")";
    String[] strArray = {"\"I am Tom\"", "\"I am Lisa\""};
    assertArrayEquals(strArray, checkPattern(text, quoteRegex));
}

我们在这里得到字符串作为数组元素。