NUnit - 如何比较包含复合Unicode字符的字符串?

时间:2012-02-27 08:27:07

标签: string unicode localization nunit

我正在使用NUnit v2.5来比较包含复合Unicode字符的字符串 尽管比较本身运作良好,但表明第一个差异的插入符号似乎是错误的。

UPD:我最终被覆盖的EqualConstraint反过来调用了自定义TextMessageWriter,所以我不再需要答案了。请参阅下面的解决方案。

以下是片段:

string s1 = "ใช้งานง่าย";
string s2 = "ใช้งานงาย";
Assert.That(s1, Is.EqualTo(s2));

这是输出:

Expected: "ใช้งานงาย"
But was:  "ใช้งานง่าย"
------------------^

表示第一个不同角色的箭头似乎偏离了2个位置(上面有多个音标)。对于更长的琴弦,它会变得非常痛苦 我试过了String.Normalize(),但也不行。

我如何克服这个问题?感谢您的帮助。请参阅下面的答案。

3 个答案:

答案 0 :(得分:1)

在比较Unicode字符串时,必须始终对比较的两边进行规范化,并采用相同的方式。对s1s2进行二进制比较是不够的,因为规范等效的字符串不会测试二进制等价。

假设存在四个平凡规范化函数,四个规范化形式中的每一个都有一个,您可能希望将NFD(s1)测试二元方程到NFD(s2)。在那里使用NFDNFC并不重要,但您必须对两个字符串执行相同的操作。

对于k-compat函数,NFKD和NFKD,这些在进行字符串搜索时很有用,因为它们以某种精度为代价改善了调用。例如,NFKD("™")将等于NFKD("TM")。这就是Adobe Reader所做的事情,例如,当您对文档进行搜索时:它始终以k-compat模式运行搜索,以便您的搜索更有可能查找内容。但是,与NFCNFD不同,k-compat函数NFKCNFKD会丢失信息,并且不可逆。但是,使用简单的NFDNFC,您可以随时返回到另一个。

答案 1 :(得分:0)

您应该能够使用this answer中的代码将每个字符串转换为原始字符串的转义版本。复合字符将成为单个转义的\u代码点,而组合字符将是一系列此类转义。然后在这些转义版本的字符串上运行Assert

答案 2 :(得分:0)

我想我找不到更好的答案,所以回答我自己的问题。

<强>原因。
有许多语言使用非间距修饰符表示字符。对于欧洲语言,有替代品,例如"u" (U+0075) + "¨" (U+00A8) = "ü" (U+00FC)。在这种情况下,@ tchrist的解决方案已经足够了。

然而,对于复杂的书写系统,没有替代非间距修饰符。因此,NUnit的TextMessageWriter.WriteCaretLine(int mismatch)mismatch参数视为字节偏移,而泰语字符串的屏幕表示可能更短而不是长度插入符号行("-----^")。

<强>解。
强制WriteCaretLine(int mismatch)遵守非间距修改器,将mismatch值减少为此偏移之前发生的非间距修改器的数量。
实现所有实际需要的补充类,只是为了调用新代码。

与泰国人一起,我用梵文和西藏人进行了测试。它按预期工作。

又一个陷阱。如果您像我一样通过ReSharper在Visual Studio中使用NUnit,则必须配置Internet Explorer的字体(无法使用R#进行管理),以便它为Thai,Devanagari使用正确的等宽字体,等

<强>实现。

  1. 继承TextMessageWriter并覆盖其DisplayStringDifferences;
  2. 实施您自己的ClipExpectedAndActualFindMismatchPosition - 这里是非间距修饰符得到尊重;需要适当的削波,因为它也可能影响非间距元素的计算。
  3. 继承EqualConstraint并覆盖其WriteMessageTo(MessageWriter writer),以便使用您的MessageWriter;
  4. (可选)创建自定义包装器以简单调用自定义约束。
  5. 源代码如下。大约80%的代码没有做任何有用的事情,但由于原始代码中的访问级别而将其包括在内。

    // Step 1.
    public class ThaiMessageWriter : TextMessageWriter
    {
        /// <summary>
        /// This method is merely a copy of the original method taken from NUnit sources,
        /// except that it changes meaning of <paramref name="mismatch"/> before the caret line is displayed.
        /// <remarks>
        /// Originally passed <paramref name="mismatch"/> contains byte offset, while proper display of caret requires
        /// it position to be calculated in character placeholder units. They are different in case of
        /// over- or under-string Unicode characters like acute mark or complex script (Thai)
        /// </remarks> 
        /// </summary>
        /// <param name="clipping"></param>
        public override void DisplayStringDifferences(string expected, string actual, int mismatch, bool ignoreCase, bool clipping)
        {
            // Maximum string we can display without truncating
            int maxDisplayLength = MaxLineLength
                                   - PrefixLength   // Allow for prefix
                                   - 2;             // 2 quotation marks
    
            int mismatchOffset = mismatch;
    
            if (clipping)
                MsgUtils2.ClipExpectedAndActual(ref expected, ref actual, maxDisplayLength, mismatchOffset);
    
            expected = MsgUtils.EscapeControlChars(expected);
            actual = MsgUtils.EscapeControlChars(actual);
    
            // The mismatch position may have changed due to clipping or white space conversion
            int mismatchInCharPlaceholders = MsgUtils2.FindMismatchPosition(expected, actual, 0, ignoreCase);
    
            Write(Pfx_Expected);
            WriteExpectedValue(expected);
            if (ignoreCase)
                WriteModifier("ignoring case");
            WriteLine();
            WriteActualLine(actual);
            //DisplayDifferences(expected, actual);
            if (mismatch >= 0)
                WriteCaretLine(mismatchInCharPlaceholders);
    
        }
    
        // Copied due to private
        /// <summary>
        /// Write the generic 'Actual' line for a constraint
        /// </summary>
        /// <param name="constraint">The constraint for which the actual value is to be written</param>
        private void WriteActualLine(Constraint constraint)
        {
            Write(Pfx_Actual);
            constraint.WriteActualValueTo(this);
            WriteLine();
        }
    
        // Copied due to private
        /// <summary>
        /// Write the generic 'Actual' line for a given value
        /// </summary>
        /// <param name="actual">The actual value causing a failure</param>
        private void WriteActualLine(object actual)
        {
            Write(Pfx_Actual);
            WriteActualValue(actual);
            WriteLine();
        }
    
        // Copied due to private
        private void WriteCaretLine(int mismatch)
        {
            // We subtract 2 for the initial 2 blanks and add back 1 for the initial quote
            WriteLine("  {0}^", new string('-', PrefixLength + mismatch - 2 + 1));
        }
    }
    
    // Step 2.
    public static class MsgUtils2
    {
        private static readonly string ELLIPSIS = "...";
    
        /// <summary>
        ///  Almost a copy of MsgUtil.ClipExpectedAndActual method
        /// </summary>
        /// <param name="expected"></param>
        /// <param name="actual"></param>
        /// <param name="maxDisplayLength"></param>
        /// <param name="mismatch"></param>
        public static void ClipExpectedAndActual(ref string expected, ref string actual, int maxDisplayLength, int mismatch)
        {
            // Case 1: Both strings fit on line
            int maxStringLength = Math.Max(expected.Length, actual.Length);
            if (maxStringLength <= maxDisplayLength)
                return;
    
            // Case 2: Assume that the tail of each string fits on line
            int clipLength = maxDisplayLength - ELLIPSIS.Length;
            int clipStart = maxStringLength - clipLength;
    
            // Case 3: If it doesn't, center the mismatch position
            if (clipStart > mismatch)
                clipStart = Math.Max(0, mismatch - clipLength / 2);
    
            // shift both clipStart and maxDisplayLength if they split non-placeholding symbol
            AdjustForNonPlaceholdingCharacter(expected, ref clipStart);
            AdjustForNonPlaceholdingCharacter(expected, ref maxDisplayLength);
    
            expected = MsgUtils.ClipString(expected, maxDisplayLength, clipStart);
            actual = MsgUtils.ClipString(actual, maxDisplayLength, clipStart);
        }
    
        private static void AdjustForNonPlaceholdingCharacter(string expected, ref int index)
        {
    
            while (index > 0 && CharUnicodeInfo.GetUnicodeCategory(expected[index]) == UnicodeCategory.NonSpacingMark)
            {
                index--;
            }
        }
    
        static public int FindMismatchPosition(string expected, string actual, int istart, bool ignoreCase)
        {
            int length = Math.Min(expected.Length, actual.Length);
    
            string s1 = ignoreCase ? expected.ToLower() : expected;
            string s2 = ignoreCase ? actual.ToLower() : actual;
    
            int iSpacingCharacters = 0;
            for (int i = 0; i < istart; i++)
            {
                if (CharUnicodeInfo.GetUnicodeCategory(s1[i]) != UnicodeCategory.NonSpacingMark)
                    iSpacingCharacters++;
            }
            for (int i = istart; i < length; i++)
            {
                if (s1[i] != s2[i])
                    return iSpacingCharacters;
                if (CharUnicodeInfo.GetUnicodeCategory(s1[i]) != UnicodeCategory.NonSpacingMark)
                    iSpacingCharacters++;
            }
    
            //
            // Strings have same content up to the length of the shorter string.
            // Mismatch occurs because string lengths are different, so show
            // that they start differing where the shortest string ends
            //
            if (expected.Length != actual.Length)
                return length;
    
            //
            // Same strings : We shouldn't get here
            //
            return -1;
        }
    }
    
    // Step 3.
    public class ThaiEqualConstraint : EqualConstraint
    {
        private readonly string _expected;
    
        // WTF expected is private?
        public ThaiEqualConstraint(string expected) : base(expected)
        {
            _expected = expected;
        }
    
        public override void WriteMessageTo(MessageWriter writer)
        {
            // redirect output to customized MessageWriter
            var myMessageWriter = new ThaiMessageWriter();
            base.WriteMessageTo(myMessageWriter);
            writer.Write(myMessageWriter);
        }
    }
    
    // Step 4.
    public static class ThaiText
    {
        public static EqualConstraint IsEqual(string expected)
        {
            return new ThaiEqualConstraint(expected);
        }
    }