我有一种语言将字符串定义为由单引号或双引号分隔,其中分隔符在字符串中通过加倍来转义。例如,以下所有内容都是合法字符串:
'This isn''t easy to parse.'
'Then John said, "Hello Tim!"'
"This isn't easy to parse."
"Then John said, ""Hello Tim!"""
我有一个字符串集合(上面定义),由不包含引号的东西分隔。我试图用正则表达式做的是解析列表中的每个字符串。例如,这是一个输入:
"一些字符串#1"或者'一些字符串#2' AND"一些' String' #3" XOR
'一些" String" #4' HOWDY"一些"" String"" #5" FOO' Some' String'' #6'
确定字符串是否具有这种形式的正则表达式是微不足道的:
^(?:"(?:[^"]|"")*"|'(?:[^']|'')*')(?:\s+[^"'\s]+\s+(?:"(?:[^"]|"")*"|'(?:[^']|'')*')*
在运行上面的表达式来测试它是否是这种形式之后,我需要另一个正则表达式来从输入中获取每个分隔的字符串。我计划如下:
Pattern pattern = Pattern.compile("What REGEX goes here?");
Matcher matcher = pattern.matcher(inputString);
int startIndex = 0;
while (matcher.find(startIndex))
{
String quote = matcher.group(1);
String quotedString = matcher.group(2);
...
startIndex = matcher.end();
}
我想要一个正则表达式来捕获组#1中的引号字符,以及组#2中引号中的文本(我正在使用Java Regex)。因此,对于上面的输入,我正在寻找一个在每个循环迭代中产生以下输出的正则表达式:
Loop 1: matcher.group(1) = "
matcher.group(2) = Some String #1
Loop 2: matcher.group(1) = '
matcher.group(2) = Some String #2
Loop 3: matcher.group(1) = "
matcher.group(2) = Some 'String' #3
Loop 4: matcher.group(1) = '
matcher.group(2) = Some "String" #4
Loop 5: matcher.group(1) = "
matcher.group(2) = Some ""String"" #5
Loop 6: matcher.group(1) = '
matcher.group(2) = Some ''String'' #6
到目前为止我尝试过的模式(未转义,然后转义为Java代码):
(["'])((?:[^\1]|\1\1)*)\1
"([\"'])((?:[^\\1]|\\1\\1)*)\\1"
(?<quot>")(?<val>(?:[^"]|"")*)"|(?<quot>')(?<val>(?:[^']|'')*)'
"(?<quot>\")(?<val>(?:[^\"]|\"\")*)\"|(?<quot>')(?<val>(?:[^']|'')*)'"
尝试编译模式时,这两个都失败了。
这样的正则表达式可能吗?
答案 0 :(得分:2)
创建一个与您匹配的实用程序类:
/tmp
答案 1 :(得分:0)
我不确定这是否是您要求的,但您可以编写一些代码来解析字符串并获得所需的结果(引用字符和内部文本)而不是使用常规表达
class Parser {
public static ParseResult parse(String str)
throws ParseException {
if(str == null || (str.length() < 2)){
throw new ParseException();
}
Character delimiter = getDelimiter(str);
// Remove delimiters
str = str.substring(1, str.length() -1);
// Unescape escaped quotes in inner string
String escapedDelim = "" + delimiter + delimiter;
str = str.replaceAll(escapedDelim, "" + delimiter);
return new ParseResult(delimiter, str);
}
private static Character getDelimiter(String str)
throws ParseException {
Character firstChar = str.charAt(0);
Character lastChar = str.charAt(str.length() -1);
if(!firstChar.equals(lastChar)){
throw new ParseException(String.format(
"First char (%s) doesn't match last char (%s) for string %s",
firstChar, lastChar, str
));
}
return firstChar;
}
}
class ParseResult {
public final Character delimiter;
public final String contents;
public ParseResult(Character delimiter, String contents){
this.delimiter = delimiter;
this.contents = contents;
}
}
class ParseException extends Exception {
public ParseException(){
super();
}
public ParseException(String msg){
super(msg);
}
}
答案 2 :(得分:0)
使用此正则表达式:
"^('|\")(.*)\\1$"
一些测试代码:
public static void main(String[] args) {
String[] tests = {
"'This isn''t easy to parse.'",
"'Then John said, \"Hello Tim!\"'",
"\"This isn't easy to parse.\"",
"\"Then John said, \"\"Hello Tim!\"\"\""};
Pattern pattern = Pattern.compile("^('|\")(.*)\\1$");
Arrays.stream(tests).map(pattern::matcher).filter(Matcher::find).forEach(m -> System.out.println("1=" + m.group(1) + ", 2=" + m.group(2)));
}
输出:
1=', 2=This isn''t easy to parse. 1=', 2=Then John said, "Hello Tim!" 1=", 2=This isn't easy to parse. 1=", 2=Then John said, ""Hello Tim!""
如果您对如何在文本中捕获引用文本感兴趣:
此正则表达式匹配所有变体并捕获组1中的引用和组6中的引用文本:
^((')|("))(.*?("\3|")(.*)\5)?.*\1$
请参阅live demo。
这是一些测试代码:
public static void main(String[] args) {
String[] tests = {
"'This isn''t easy to parse.'",
"'Then John said, \"Hello Tim!\"'",
"\"This isn't easy to parse.\"",
"\"Then John said, \"\"Hello Tim!\"\"\""};
Pattern pattern = Pattern.compile("^((')|(\"))(.*?(\"\\3|\")(.*)\\5)?.*\\1$");
Arrays.stream(tests).map(pattern::matcher).filter(Matcher::find)
.forEach(m -> System.out.println("quote=" + m.group(1) + ", quoted=" + m.group(6)));
}
输出:
quote=', quoted=null quote=', quoted=Hello Tim! quote=", quoted=null quote=", quoted=Hello Tim!
答案 3 :(得分:0)
对这类问题使用正则表达式非常具有挑战性。不使用正则表达式的简单解析器更容易实现,理解和维护。
此外,这样一个简单的解析可以轻松支持反斜杠转义,以及将反斜杠序列转换为字符(例如“\ n”转换为换行符)。
答案 4 :(得分:0)
这可以通过下面的简单正则表达式轻松完成
private static Object[] checkPattern(String name, String regex) {
List<String> matchedString = new ArrayList<>();
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(name);
while (matcher.find()) {
if (matcher.group().length() > 0) {
matchedString.add(matcher.group());
}
}
return matchedString.toArray();
}
@Test
public void quotedtextMultipleQuotedLines() {
String text = "He said, \"I am Tom\". She said, \"I am Lisa\".";
String quoteRegex = "(\"[^\"]+\")";
String[] strArray = {"\"I am Tom\"", "\"I am Lisa\""};
assertArrayEquals(strArray, checkPattern(text, quoteRegex));
}
我们在这里得到字符串作为数组元素。