注意
已经问过另一个问题C# Regular Expressions with \Uxxxxxxxx characters in the pattern。这个问题的不同之处在于它不是关于如何计算代理对,而是如何在正则表达式中表达高于0的unicode平面。从我的问题中我应该清楚,我已经理解为什么这些代码单元被表示为2个字符 - 它们是代理对(这是另一个问题所要求的)。我的问题是如何一般地转换它们(因为我无法控制正在使用该程序的正则表达式),因此它们可以被.NET Regex引擎使用。
注意我现在有办法做到这一点,并希望将我的答案添加到我的问题中,但由于现在标记为重复,我无法添加我的答案。
我有一些测试数据被传递给我移植到c#的Java库。我已经将一个特定的问题案例作为一个例子。原文中的字符类是UTF-32 = \U0001BCA0-\U0001BCA3
,它不容易被.NET使用 - 我们收到"Unrecognized escape sequence \U"
错误。
我尝试转换为UTF-16,并且我已确认\U0001BCA0和\U0001BCA3的结果应该是预期的。
UTF-32 | Codepoint | High Surrogate | Low Surrogate | UTF-16
---------------------------------------------------------------------------
0x0001BCA0 | 113824 | 55343 | 56480 | \uD82F\uDCA0
0x0001BCA3 | 113827 | 55343 | 56483 | \uD82F\uDCA3
但是,当我将字符串"([\uD82F\uDCA0-\uD82F\uDCA3])"
传递给Regex
类的构造函数时,我得到一个异常"[x-y] range in reverse order"
。
虽然很清楚字符是以正确的顺序指定的(它在Java中工作),但我反过来尝试并得到相同的错误信息。
我也尝试将UTF-32字符从\U0001BCA0-\U0001BCA3
更改为\x01BCA0-\x01BCA3
,但仍然获得例外"[x-y] range in reverse order"
。
那么,我如何让.NET Regex
类成功解析这个字符范围?
注意:我尝试更改代码以生成一个正则表达式字符类,其中包含所有字符而不是范围,它似乎可以正常工作,但这会使我的正则表达式变为几十个字符变成几千个字符,这肯定不会为表现带来奇迹。
同样,上面是一个更大字符串失败的孤立示例。我正在寻找的是转换像这样的正则表达式的一般方法,因此它们可以由.NET Regex
类解析。
"([\\u0000-\\u0009\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F\\u00AD" +
"\\u061C\\u180E\\u200B\\u200E\\u200F\\u2028-\\u202E\\u2060-\\u206F\\uD800-" +
"\\uDFFF\\uFEFF\\uFFF0-\\uFFFB\\U0001BCA0-\\U0001BCA3\\U0001D173-" +
"\\U0001D17A\\U000E0000-\\U000E001F\\U000E0080-\\U000E00FF\\U000E01F0-\\U000E0FFF] " +
"| [\\u000D] | [\\u000A]) ()"
答案 0 :(得分:3)
您认为Regex
会将"\uD82F\uDCA0"
识别为复合字符。情况并非如此,因为.NET中字符串的内部表示是16位Unicode。
Unicode具有code points的概念,这是一个独立于物理表示的抽象概念。根据所使用的实际编码,并非所有代码点都可以显示在一个字符中。在UTF-8中,这变得非常明显,因为127以上的所有代码点都需要两个或更多字符。在.NET中,字符是Unicode,这意味着对于高于0的planes,您需要组合字符。这些仍然被正则表达式引擎识别为单个字符。
长话短说:不要将字符组合视为代码点,将它们视为单个字符。所以在你的情况下,正则表达式将是:
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
var regex = new Regex("(\uD82F[\uDCA0-\uDCA3])");
Console.WriteLine(regex.Match("\uD82F\uDCA2").Success);
}
}
答案 1 :(得分:1)
C#中的字符串是UTF-16编码的。这就是为什么这个正则表达式被视为:
'\uD82F'
或\uDCA0-\uD82F
或'\uDCA3'
范围\uDCA0-\uD82F
显然不正确,导致[x-y] range in reverse order
例外。
不幸的是,对于你的问题没有简单的解决方案,因为它是由C#字符串的性质引起的。您不能将UTF-32符号放入一个C#字符中,并且不能使用多字符字符串作为范围边界。
可能的解决方法是使用半正则表达式解决方案:从字符串中提取此类符号,并通过纯C#代码执行比较。当然看起来很难看,但是我没有看到用C#中的原始正则表达式来实现这个目标的另一种方法。
答案 2 :(得分:1)
虽然这个问题的其他贡献者提供了一些线索,但我需要一个答案。我的测试是一个由文件输入构建的正则表达式驱动的规则引擎,因此将逻辑硬编码到C#中不是一种选择。
但是,我确实在这里学到了
intList = stringList
.stream()
.forEach(s - > {intForString(s));
.collect(Collectors.toList());
类不支持代理项对和但是,当然,在我的数据驱动的情况下,我无法手动将正则表达式更改为.NET将接受的格式 - 我需要自动化它。因此,我创建了以下Regex
类,它在构造函数中直接接受UTF32字符,并在内部将它们转换为.NET理解的正则表达式。
例如,它将转换正则表达式
Utf32Regex
要
"[abc\\U00011DEF-\\U00013E07]"
或者
"(?:[abc]|\\uD807[\\uDDEF-\\uDFFF]|[\\uD808-\\uD80E][\\uDC00-\\uDFFF]|\\uD80F[\\uDC00-\\uDE07])"
要
"([\\u0000-\\u0009\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F\\u00AD" +
"\\u061C\\u180E\\u200B\\u200E\\u200F\\u2028-\\u202E\\u2060-\\u206F\\uD800-" +
"\\uDFFF\\uFEFF\\uFFF0-\\uFFFB\\U0001BCA0-\\U0001BCA3\\U0001D173-" +
"\\U0001D17A\\U000E0000-\\U000E001F\\U000E0080-\\U000E00FF\\U000E01F0-\\U000E0FFF] " +
"| [\\u000D] | [\\u000A]) ()"
"((?:[\\u0000-\\u0009\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F\\u00AD\\u061C\\u180E" +
"\\u200B\\u200E\\u200F\\u2028-\\u202E\\u2060-\\u206F\\uD800-\\uDFFF\\uFEFF\\uFFF0-\\uFFFB]|" +
"\\uD82F[\\uDCA0-\\uDCA3]|\\uD834[\\uDD73-\\uDD7A]|\\uDB40[\\uDC00-\\uDC1F]|" +
"\\uDB40[\\uDC80-\\uDCFF]|\\uDB40[\\uDDF0-\\uDFFF]|[\\uDB41-\\uDB42][\\uDC00-\\uDFFF]|" +
"\\uDB43[\\uDC00-\\uDFFF]) | [\\u000D] | [\\u000A]) ()"
using System;
using System.Globalization;
using System.Text;
using System.Text.RegularExpressions;
/// <summary>
/// Patches the <see cref="Regex"/> class so it will automatically convert and interpret
/// UTF32 characters expressed like <c>\U00010000</c> or UTF32 ranges expressed
/// like <c>\U00010000-\U00010001</c>.
/// </summary>
public class Utf32Regex : Regex
{
private const char MinLowSurrogate = '\uDC00';
private const char MaxLowSurrogate = '\uDFFF';
private const char MinHighSurrogate = '\uD800';
private const char MaxHighSurrogate = '\uDBFF';
// Match any character class such as [A-z]
private static readonly Regex characterClass = new Regex(
"(?<!\\\\)(\\[.*?(?<!\\\\)\\])",
RegexOptions.Compiled);
// Match a UTF32 range such as \U000E01F0-\U000E0FFF
// or an individual character such as \U000E0FFF
private static readonly Regex utf32Range = new Regex(
"(?<begin>\\\\U(?:00)?[0-9A-Fa-f]{6})-(?<end>\\\\U(?:00)?[0-9A-Fa-f]{6})|(?<begin>\\\\U(?:00)?[0-9A-Fa-f]{6})",
RegexOptions.Compiled);
public Utf32Regex()
: base()
{
}
public Utf32Regex(string pattern)
: base(ConvertUTF32Characters(pattern))
{
}
public Utf32Regex(string pattern, RegexOptions options)
: base(ConvertUTF32Characters(pattern), options)
{
}
public Utf32Regex(string pattern, RegexOptions options, TimeSpan matchTimeout)
: base(ConvertUTF32Characters(pattern), options, matchTimeout)
{
}
private static string ConvertUTF32Characters(string regexString)
{
StringBuilder result = new StringBuilder();
// Convert any UTF32 character ranges \U00000000-\U00FFFFFF to their
// equivalent UTF16 characters
ConvertUTF32CharacterClassesToUTF16Characters(regexString, result);
// Now find all of the individual characters that were not in ranges and
// fix those as well.
ConvertUTF32CharactersToUTF16(result);
return result.ToString();
}
private static void ConvertUTF32CharacterClassesToUTF16Characters(string regexString, StringBuilder result)
{
Match match = characterClass.Match(regexString); // Reset
int lastEnd = 0;
if (match.Success)
{
do
{
string characterClass = match.Groups[1].Value;
string convertedCharacterClass = ConvertUTF32CharacterRangesToUTF16Characters(characterClass);
result.Append(regexString.Substring(lastEnd, match.Index - lastEnd)); // Remove the match
result.Append(convertedCharacterClass); // Append replacement
lastEnd = match.Index + match.Length;
} while ((match = match.NextMatch()).Success);
}
result.Append(regexString.Substring(lastEnd)); // Append tail
}
private static string ConvertUTF32CharacterRangesToUTF16Characters(string characterClass)
{
StringBuilder result = new StringBuilder();
StringBuilder chars = new StringBuilder();
Match match = utf32Range.Match(characterClass); // Reset
int lastEnd = 0;
if (match.Success)
{
do
{
string utf16Chars;
string rangeBegin = match.Groups["begin"].Value.Substring(2);
if (!string.IsNullOrEmpty(match.Groups["end"].Value))
{
string rangeEnd = match.Groups["end"].Value.Substring(2);
utf16Chars = UTF32RangeToUTF16Chars(rangeBegin, rangeEnd);
}
else
{
utf16Chars = UTF32ToUTF16Chars(rangeBegin);
}
result.Append(characterClass.Substring(lastEnd, match.Index - lastEnd)); // Remove the match
chars.Append(utf16Chars); // Append replacement
lastEnd = match.Index + match.Length;
} while ((match = match.NextMatch()).Success);
}
result.Append(characterClass.Substring(lastEnd)); // Append tail of character class
// Special case - if we have removed all of the contents of the
// character class, we need to remove the square brackets and the
// alternation character |
int emptyCharClass = result.IndexOf("[]");
if (emptyCharClass >= 0)
{
result.Remove(emptyCharClass, 2);
// Append replacement ranges (exclude beginning |)
result.Append(chars.ToString(1, chars.Length - 1));
}
else
{
// Append replacement ranges
result.Append(chars.ToString());
}
if (chars.Length > 0)
{
// Wrap both the character class and any UTF16 character alteration into
// a non-capturing group.
return "(?:" + result.ToString() + ")";
}
return result.ToString();
}
private static void ConvertUTF32CharactersToUTF16(StringBuilder result)
{
while (true)
{
int where = result.IndexOf("\\U00");
if (where < 0)
{
break;
}
string cp = UTF32ToUTF16Chars(result.ToString(where + 2, 8));
result.Replace(where, where + 10, cp);
}
}
private static string UTF32RangeToUTF16Chars(string hexBegin, string hexEnd)
{
var result = new StringBuilder();
int beginCodePoint = int.Parse(hexBegin, NumberStyles.HexNumber);
int endCodePoint = int.Parse(hexEnd, NumberStyles.HexNumber);
var beginChars = char.ConvertFromUtf32(beginCodePoint);
var endChars = char.ConvertFromUtf32(endCodePoint);
int beginDiff = endChars[0] - beginChars[0];
if (beginDiff == 0)
{
// If the begin character is the same, we can just use the syntax \uD807[\uDDEF-\uDFFF]
result.Append("|");
AppendUTF16Character(result, beginChars[0]);
result.Append('[');
AppendUTF16Character(result, beginChars[1]);
result.Append('-');
AppendUTF16Character(result, endChars[1]);
result.Append(']');
}
else
{
// If the begin character is not the same, create 3 ranges
// 1. The remainder of the first
// 2. A range of all of the middle characters
// 3. The beginning of the last
result.Append("|");
AppendUTF16Character(result, beginChars[0]);
result.Append('[');
AppendUTF16Character(result, beginChars[1]);
result.Append('-');
AppendUTF16Character(result, MaxLowSurrogate);
result.Append(']');
// We only need a middle range if the ranges are not adjacent
if (beginDiff > 1)
{
result.Append("|");
// We only need a character class if there are more than 1
// characters in the middle range
if (beginDiff > 2)
{
result.Append('[');
}
AppendUTF16Character(result, (char)(Math.Min(beginChars[0] + 1, MaxHighSurrogate)));
if (beginDiff > 2)
{
result.Append('-');
AppendUTF16Character(result, (char)(Math.Max(endChars[0] - 1, MinHighSurrogate)));
result.Append(']');
}
result.Append('[');
AppendUTF16Character(result, MinLowSurrogate);
result.Append('-');
AppendUTF16Character(result, MaxLowSurrogate);
result.Append(']');
}
result.Append("|");
AppendUTF16Character(result, endChars[0]);
result.Append('[');
AppendUTF16Character(result, MinLowSurrogate);
result.Append('-');
AppendUTF16Character(result, endChars[1]);
result.Append(']');
}
return result.ToString();
}
private static string UTF32ToUTF16Chars(string hex)
{
int codePoint = int.Parse(hex, NumberStyles.HexNumber, CultureInfo.InvariantCulture);
return UTF32ToUTF16Chars(codePoint);
}
private static string UTF32ToUTF16Chars(int codePoint)
{
StringBuilder result = new StringBuilder();
UTF32ToUTF16Chars(codePoint, result);
return result.ToString();
}
private static void UTF32ToUTF16Chars(int codePoint, StringBuilder result)
{
// Use regex alteration to on the entire range of UTF32 code points
// to ensure each one is treated as a group.
result.Append("|");
AppendUTF16CodePoint(result, codePoint);
}
private static void AppendUTF16CodePoint(StringBuilder text, int cp)
{
var chars = char.ConvertFromUtf32(cp);
AppendUTF16Character(text, chars[0]);
if (chars.Length == 2)
{
AppendUTF16Character(text, chars[1]);
}
}
private static void AppendUTF16Character(StringBuilder text, char c)
{
text.Append(@"\u");
text.Append(Convert.ToString(c, 16).ToUpperInvariant());
}
}
请注意,这个测试不是很好,可能不是很强大,但出于测试目的,应该没问题。