正则表达式不使用Unicode字符范围

时间:2017-12-02 05:45:46

标签: c# .net regex unicode

  

注意

     

已经问过另一个问题C# Regular Expressions with \Uxxxxxxxx characters in the pattern。这个问题的不同之处在于它不是关于如何计算代理对,而是如何在正则表达式中表达高于0的unicode平面。从我的问题中我应该清楚,我已经理解为什么这些代码单元被表示为2个字符 - 它们是代理对(这是另一个问题所要求的)。我的问题是如何一般地转换它们(因为我无法控制正在使用该程序的正则表达式),因此它们可以被.NET Regex引擎使用。

     

注意我现在有办法做到这一点,并希望将我的答案添加到我的问题中,但由于现在标记为重复,我无法添加我的答案。

我有一些测试数据被传递给我移植到c#的Java库。我已经将一个特定的问题案例作为一个例子。原文中的字符类是UTF-32 = \U0001BCA0-\U0001BCA3,它不容易被.NET使用 - 我们收到"Unrecognized escape sequence \U"错误。

我尝试转换为UTF-16,并且我已确认\U0001BCA0\U0001BCA3的结果应该是预期的。

UTF-32      | Codepoint   | High Surrogate  | Low Surrogate  | UTF-16
---------------------------------------------------------------------------
0x0001BCA0  | 113824      | 55343           | 56480          | \uD82F\uDCA0
0x0001BCA3  | 113827      | 55343           | 56483          | \uD82F\uDCA3

但是,当我将字符串"([\uD82F\uDCA0-\uD82F\uDCA3])"传递给Regex类的构造函数时,我得到一个异常"[x-y] range in reverse order"

虽然很清楚字符是以正确的顺序指定的(它在Java中工作),但我反过来尝试并得到相同的错误信息。

我也尝试将UTF-32字符从\U0001BCA0-\U0001BCA3更改为\x01BCA0-\x01BCA3,但仍然获得例外"[x-y] range in reverse order"

那么,我如何让.NET Regex类成功解析这个字符范围?

  

注意:我尝试更改代码以生成一个正则表达式字符类,其中包含所有字符而不是范围,它似乎可以正常工作,但这会使我的正则表达式变为几十个字符变成几千个字符,这肯定不会为表现带来奇迹。

实际正则表达式示例

同样,上面是一个更大字符串失败的孤立示例。我正在寻找的是转换像这样的正则表达式的一般方法,因此它们可以由.NET Regex类解析。

"([\\u0000-\\u0009\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F\\u00AD" +
"\\u061C\\u180E\\u200B\\u200E\\u200F\\u2028-\\u202E\\u2060-\\u206F\\uD800-" +
"\\uDFFF\\uFEFF\\uFFF0-\\uFFFB\\U0001BCA0-\\U0001BCA3\\U0001D173-" +
"\\U0001D17A\\U000E0000-\\U000E001F\\U000E0080-\\U000E00FF\\U000E01F0-\\U000E0FFF] " +
"| [\\u000D] | [\\u000A]) ()"

3 个答案:

答案 0 :(得分:3)

您认为Regex会将"\uD82F\uDCA0"识别为复合字符。情况并非如此,因为.NET中字符串的内部表示是16位Unicode。

Unicode具有code points的概念,这是一个独立于物理表示的抽象概念。根据所使用的实际编码,并非所有代码点都可以显示在一个字符中。在UTF-8中,这变得非常明显,因为127以上的所有代码点都需要两个或更多字符。在.NET中,字符是Unicode,这意味着对于高于0的planes,您需要组合字符。这些仍然被正则表达式引擎识别为单个字符。

长话短说:不要将字符组合视为代码点,将它们视为单个字符。所以在你的情况下,正则表达式将是:

using System;
using System.Text.RegularExpressions;

public class Program
{
    public static void Main()
    {
        var regex = new Regex("(\uD82F[\uDCA0-\uDCA3])");
        Console.WriteLine(regex.Match("\uD82F\uDCA2").Success);
    }
}

你可以try out the code here

答案 1 :(得分:1)

C#中的字符串是UTF-16编码的。这就是为什么这个正则表达式被视为:

  • 符号'\uD82F'
  • 范围\uDCA0-\uD82F
  • 符号'\uDCA3'

范围\uDCA0-\uD82F显然不正确,导致[x-y] range in reverse order例外。

不幸的是,对于你的问题没有简单的解决方案,因为它是由C#字符串的性质引起的。您不能将UTF-32符号放入一个C#字符中,并且不能使用多字符字符串作为范围边界。

可能的解决方法是使用半正则表达式解决方案:从字符串中提取此类符号,并通过纯C#代码执行比较。当然看起来很难看,但是我没有看到用C#中的原始正则表达式来实现这个目标的另一种方法。

答案 2 :(得分:1)

虽然这个问题的其他贡献者提供了一些线索,但我需要一个答案。我的测试是一个由文件输入构建的正则表达式驱动的规则引擎,因此将逻辑硬编码到C#中不是一种选择。

但是,我确实在这里学到了

  1. .NET intList = stringList .stream() .forEach(s - > {intForString(s)); .collect(Collectors.toList()); 类不支持代理项对和
  2. 您可以使用正则表达式更改伪造对替代对范围的支持
  3. 但是,当然,在我的数据驱动的情况下,我无法手动将正则表达式更改为.NET将接受的格式 - 我需要自动化它。因此,我创建了以下Regex类,它在构造函数中直接接受UTF32字符,并在内部将它们转换为.NET理解的正则表达式。

    例如,它将转换正则表达式

    Utf32Regex

    "[abc\\U00011DEF-\\U00013E07]"
    

    或者

    "(?:[abc]|\\uD807[\\uDDEF-\\uDFFF]|[\\uD808-\\uD80E][\\uDC00-\\uDFFF]|\\uD80F[\\uDC00-\\uDE07])"
    

    "([\\u0000-\\u0009\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F\\u00AD" +
    "\\u061C\\u180E\\u200B\\u200E\\u200F\\u2028-\\u202E\\u2060-\\u206F\\uD800-" +
    "\\uDFFF\\uFEFF\\uFFF0-\\uFFFB\\U0001BCA0-\\U0001BCA3\\U0001D173-" +
    "\\U0001D17A\\U000E0000-\\U000E001F\\U000E0080-\\U000E00FF\\U000E01F0-\\U000E0FFF] " +
    "| [\\u000D] | [\\u000A]) ()"
    

    Utf32Regex.cs

    "((?:[\\u0000-\\u0009\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F\\u00AD\\u061C\\u180E" + 
    "\\u200B\\u200E\\u200F\\u2028-\\u202E\\u2060-\\u206F\\uD800-\\uDFFF\\uFEFF\\uFFF0-\\uFFFB]|" + 
    "\\uD82F[\\uDCA0-\\uDCA3]|\\uD834[\\uDD73-\\uDD7A]|\\uDB40[\\uDC00-\\uDC1F]|" + 
    "\\uDB40[\\uDC80-\\uDCFF]|\\uDB40[\\uDDF0-\\uDFFF]|[\\uDB41-\\uDB42][\\uDC00-\\uDFFF]|" + 
    "\\uDB43[\\uDC00-\\uDFFF]) | [\\u000D] | [\\u000A]) ()"
    

    StringBuilderExtensions.cs

    using System;
    using System.Globalization;
    using System.Text;
    using System.Text.RegularExpressions;
    
    /// <summary>
    /// Patches the <see cref="Regex"/> class so it will automatically convert and interpret
    /// UTF32 characters expressed like <c>\U00010000</c> or UTF32 ranges expressed
    /// like <c>\U00010000-\U00010001</c>.
    /// </summary>
    public class Utf32Regex : Regex
    {
        private const char MinLowSurrogate = '\uDC00';
        private const char MaxLowSurrogate = '\uDFFF';
    
        private const char MinHighSurrogate = '\uD800';
        private const char MaxHighSurrogate = '\uDBFF';
    
        // Match any character class such as [A-z]
        private static readonly Regex characterClass = new Regex(
            "(?<!\\\\)(\\[.*?(?<!\\\\)\\])",
            RegexOptions.Compiled);
    
        // Match a UTF32 range such as \U000E01F0-\U000E0FFF
        // or an individual character such as \U000E0FFF
        private static readonly Regex utf32Range = new Regex(
            "(?<begin>\\\\U(?:00)?[0-9A-Fa-f]{6})-(?<end>\\\\U(?:00)?[0-9A-Fa-f]{6})|(?<begin>\\\\U(?:00)?[0-9A-Fa-f]{6})",
            RegexOptions.Compiled);
    
        public Utf32Regex()
            : base()
        {
        }
    
        public Utf32Regex(string pattern)
            : base(ConvertUTF32Characters(pattern))
        {
        }
    
        public Utf32Regex(string pattern, RegexOptions options)
            : base(ConvertUTF32Characters(pattern), options)
        {
        }
    
        public Utf32Regex(string pattern, RegexOptions options, TimeSpan matchTimeout)
            : base(ConvertUTF32Characters(pattern), options, matchTimeout)
        {
        }
    
        private static string ConvertUTF32Characters(string regexString)
        {
            StringBuilder result = new StringBuilder();
            // Convert any UTF32 character ranges \U00000000-\U00FFFFFF to their
            // equivalent UTF16 characters
            ConvertUTF32CharacterClassesToUTF16Characters(regexString, result);
            // Now find all of the individual characters that were not in ranges and
            // fix those as well.
            ConvertUTF32CharactersToUTF16(result);
    
            return result.ToString();
        }
    
        private static void ConvertUTF32CharacterClassesToUTF16Characters(string regexString, StringBuilder result)
        {
            Match match = characterClass.Match(regexString); // Reset
            int lastEnd = 0;
            if (match.Success)
            {
                do
                {
                    string characterClass = match.Groups[1].Value;
                    string convertedCharacterClass = ConvertUTF32CharacterRangesToUTF16Characters(characterClass);
    
                    result.Append(regexString.Substring(lastEnd, match.Index - lastEnd)); // Remove the match
                    result.Append(convertedCharacterClass); // Append replacement 
    
                    lastEnd = match.Index + match.Length;
                } while ((match = match.NextMatch()).Success);
            }
            result.Append(regexString.Substring(lastEnd)); // Append tail
        }
    
        private static string ConvertUTF32CharacterRangesToUTF16Characters(string characterClass)
        {
            StringBuilder result = new StringBuilder();
            StringBuilder chars = new StringBuilder();
    
            Match match = utf32Range.Match(characterClass); // Reset
            int lastEnd = 0;
            if (match.Success)
            {
                do
                {
                    string utf16Chars;
                    string rangeBegin = match.Groups["begin"].Value.Substring(2);
    
                    if (!string.IsNullOrEmpty(match.Groups["end"].Value))
                    {
                        string rangeEnd = match.Groups["end"].Value.Substring(2);
                        utf16Chars = UTF32RangeToUTF16Chars(rangeBegin, rangeEnd);
                    }
                    else
                    {
                        utf16Chars = UTF32ToUTF16Chars(rangeBegin);
                    }
    
                    result.Append(characterClass.Substring(lastEnd, match.Index - lastEnd)); // Remove the match
                    chars.Append(utf16Chars); // Append replacement 
    
                    lastEnd = match.Index + match.Length;
                } while ((match = match.NextMatch()).Success);
            }
            result.Append(characterClass.Substring(lastEnd)); // Append tail of character class
    
            // Special case - if we have removed all of the contents of the
            // character class, we need to remove the square brackets and the
            // alternation character |
            int emptyCharClass = result.IndexOf("[]");
            if (emptyCharClass >= 0)
            {
                result.Remove(emptyCharClass, 2);
                // Append replacement ranges (exclude beginning |)
                result.Append(chars.ToString(1, chars.Length - 1));
            }
            else
            {
                // Append replacement ranges
                result.Append(chars.ToString());
            }
    
            if (chars.Length > 0)
            {
                // Wrap both the character class and any UTF16 character alteration into
                // a non-capturing group.
                return "(?:" + result.ToString() + ")";
            }
            return result.ToString();
        }
    
        private static void ConvertUTF32CharactersToUTF16(StringBuilder result)
        {
            while (true)
            {
                int where = result.IndexOf("\\U00");
                if (where < 0)
                {
                    break;
                }
                string cp = UTF32ToUTF16Chars(result.ToString(where + 2, 8));
                result.Replace(where, where + 10, cp);
            }
        }
    
        private static string UTF32RangeToUTF16Chars(string hexBegin, string hexEnd)
        {
            var result = new StringBuilder();
            int beginCodePoint = int.Parse(hexBegin, NumberStyles.HexNumber);
            int endCodePoint = int.Parse(hexEnd, NumberStyles.HexNumber);
    
            var beginChars = char.ConvertFromUtf32(beginCodePoint);
            var endChars = char.ConvertFromUtf32(endCodePoint);
            int beginDiff = endChars[0] - beginChars[0];
    
            if (beginDiff == 0)
            {
                // If the begin character is the same, we can just use the syntax \uD807[\uDDEF-\uDFFF]
                result.Append("|");
                AppendUTF16Character(result, beginChars[0]);
                result.Append('[');
                AppendUTF16Character(result, beginChars[1]);
                result.Append('-');
                AppendUTF16Character(result, endChars[1]);
                result.Append(']');
            }
            else
            {
                // If the begin character is not the same, create 3 ranges
                // 1. The remainder of the first
                // 2. A range of all of the middle characters
                // 3. The beginning of the last
    
                result.Append("|");
                AppendUTF16Character(result, beginChars[0]);
                result.Append('[');
                AppendUTF16Character(result, beginChars[1]);
                result.Append('-');
                AppendUTF16Character(result, MaxLowSurrogate);
                result.Append(']');
    
                // We only need a middle range if the ranges are not adjacent
                if (beginDiff > 1)
                {
                    result.Append("|");
                    // We only need a character class if there are more than 1
                    // characters in the middle range
                    if (beginDiff > 2)
                    {
                        result.Append('[');
                    }
                    AppendUTF16Character(result, (char)(Math.Min(beginChars[0] + 1, MaxHighSurrogate)));
                    if (beginDiff > 2)
                    {
                        result.Append('-');
                        AppendUTF16Character(result, (char)(Math.Max(endChars[0] - 1, MinHighSurrogate)));
                        result.Append(']');
                    }
                    result.Append('[');
                    AppendUTF16Character(result, MinLowSurrogate);
                    result.Append('-');
                    AppendUTF16Character(result, MaxLowSurrogate);
                    result.Append(']');
                }
    
                result.Append("|");
                AppendUTF16Character(result, endChars[0]);
                result.Append('[');
                AppendUTF16Character(result, MinLowSurrogate);
                result.Append('-');
                AppendUTF16Character(result, endChars[1]);
                result.Append(']');
            }
            return result.ToString();
        }
    
        private static string UTF32ToUTF16Chars(string hex)
        {
            int codePoint = int.Parse(hex, NumberStyles.HexNumber, CultureInfo.InvariantCulture);
            return UTF32ToUTF16Chars(codePoint);
        }
    
        private static string UTF32ToUTF16Chars(int codePoint)
        {
            StringBuilder result = new StringBuilder();
            UTF32ToUTF16Chars(codePoint, result);
            return result.ToString();
        }
    
        private static void UTF32ToUTF16Chars(int codePoint, StringBuilder result)
        {
            // Use regex alteration to on the entire range of UTF32 code points
            // to ensure each one is treated as a group.
            result.Append("|");
            AppendUTF16CodePoint(result, codePoint);
        }
    
        private static void AppendUTF16CodePoint(StringBuilder text, int cp)
        {
            var chars = char.ConvertFromUtf32(cp);
            AppendUTF16Character(text, chars[0]);
            if (chars.Length == 2)
            {
                AppendUTF16Character(text, chars[1]);
            }
        }
    
        private static void AppendUTF16Character(StringBuilder text, char c)
        {
            text.Append(@"\u");
            text.Append(Convert.ToString(c, 16).ToUpperInvariant());
        }
    }
    

    请注意,这个测试不是很好,可能不是很强大,但出于测试目的,应该没问题。