与RTF字符串的平等比较

时间:2014-09-15 20:10:08

标签: c# regex rtf string-comparison

我有一个程序可以阻止复制数据并将其存储起来供以后使用。不应将相等或至少等效的项目再次添加到列表中。富文本字符串出现问题。

就我的目的而言,如果字符串具有相同的纯文本结果和相同的格式,则应视为相等。如果我错了,请纠正我,但我知道有一个嵌入的RSID被创建,RTF字符串被复制,并且每个复制的RTF字符串都不同。我目前正在使用Regex删除所有RSID。

但是,从Microsoft Word复制两次的同一个单词字符串为我提供了两个被认为不相等的RTF字符串,即使我删除它们的RSID也是如此。

使用C#,我如何通过纯文本内容和格式比较这些字符串?

我的功能目前看起来像这样:

private bool HasEquivalentRichText(string richText1, string richText2)
{
    var rsidRegex = new Regex("(rsid[0-9]+)");
    var cleanText1 = rsidRegex.Replace(richText1, string.Empty);
    var cleanText2 = rsidRegex.Replace(richText2, string.Empty);

    return cleanText1.Equals(cleanText2);
}

1 个答案:

答案 0 :(得分:3)

当Word将Word文件转换为RTF(注释 - Word文档)文件时,它会尝试通过包含各种专有令牌来完全保真地捕获原始文档。其中一个是{\*\datastore,似乎无论出于何种原因,数据存储区内的某些东西(可能是复制计数器?)在每次复制操作后都会被修改。 (此数据存储区is reported to be encrypted binary data及其内部结构似乎没有记录,因此我无法准确说明每次粘贴后它为何会发生变化。)

只要您不需要将数据粘贴回Word,就可以删除此专有信息以及rsid组:

    /// <summary>
    /// Remove a group from the incoming RTF string starting with {\groupBeginningControlWord
    /// </summary>
    /// <param name="rtf"></param>
    /// <param name="groupBeginningControlWord"></param>
    /// <returns></returns>
    static string RemoveRtfGroup(string rtf, string groupBeginningControlWord)
    {
        // see http://www.biblioscape.com/rtf15_spec.htm
        string groupBeginning = "{\\" + groupBeginningControlWord;
        int index;
        while ((index = rtf.IndexOf(groupBeginning)) >= 0)
        {
            int nextIndex = index + groupBeginning.Length;
            for (int depth = 1; depth > 0 && nextIndex < rtf.Length; nextIndex++)
            {
                if (rtf[nextIndex] == '}')
                    depth--;
                else if (rtf[nextIndex] == '{')
                    depth++;
                if (depth == 0)
                    rtf = rtf.Remove(index, nextIndex - index + 1);
            }
        }

        return rtf;
    }

    static string CleanNonFormattingFromRtf(string rtf)
    {
        var rsidRegex = new Regex("(rsid[0-9]+)");

        var cleanText = rsidRegex.Replace(rtf, string.Empty);
        cleanText = RemoveRtfGroup(cleanText, @"*\datastore");
        return cleanText;
    }

这似乎适用于从Word文档中复制一个单词两次的简单测试用例。

<强>更新

经过一些进一步的调查,似乎您可能无法通过切除不需要的元数据并比较结果来可靠地确定从Word复制的RTF字符串的相等性。

您没有提供Word文档的minimal, complete and verifiable示例,该文档为相同的复制缓冲区操作生成不同的RTF,因此我使用了Microsoft RTF spec中的页面:

enter image description here

鉴于此,我首先发现有必要删除整个*\rsidtbl组:

    static string CleanNonFormattingFromRtf(string rtf)
    {
        var rsidRegex = new Regex("(rsid[0-9]+)");

        var cleanText = rtf;
        cleanText = RemoveRtfGroup(cleanText, @"*\datastore");
        cleanText = RemoveRtfGroup(cleanText, @"*\rsidtbl");
        cleanText = rsidRegex.Replace(cleanText, string.Empty);
        return cleanText;
    }

其次,我发现Word会将化妆品CRLF引入RTF,以便每255个字符左右可读。解析文档时通常会忽略这些内容,但rsidtbl的更改可能会导致这些换行符插入到不同的位置!因此,有必要去除这样的美容休息 - 但所有换行符都不是RTF中的美容。必须保留二进制部分中的那些以及用作控制字的分隔符的那些部分。因此,有必要编写一个基本的解析器和标记器来剥离不必要的换行符,例如:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;
using System.Globalization;

public class RtfNormalizer
{
    public RtfNormalizer(string rtf)
    {
        if (rtf == null)
            throw new ArgumentNullException();
        Rtf = rtf;
    }

    public string Rtf { get; private set; }

    public string GetNormalizedString()
    {
        StringBuilder sb = new StringBuilder();
        var tokenizer = new RtfTokenizer(Rtf);

        RtfToken previous = RtfToken.None;
        while (tokenizer.MoveNext())
        {
            previous = AddCurrentToken(tokenizer, sb, previous);
        }

        return sb.ToString();
    }

    private RtfToken AddCurrentToken(RtfTokenizer tokenizer, StringBuilder sb, RtfToken previous)
    {
        var token = tokenizer.Current;
        switch (token.Type)
        {
            case RtfTokenType.None:
                break;
            case RtfTokenType.StartGroup:
                AddPushGroup(tokenizer, token, sb, previous);
                break;
            case RtfTokenType.EndGroup:
                AddPopGroup(tokenizer, token, sb, previous);
                break;
            case RtfTokenType.ControlWord:
                AddControlWord(tokenizer, token, sb, previous);
                break;
            case RtfTokenType.ControlSymbol:
                AddControlSymbol(tokenizer, token, sb, previous);
                break;
            case RtfTokenType.IgnoredDelimiter:
                AddIgnoredDelimiter(tokenizer, token, sb, previous);
                break;
            case RtfTokenType.CRLF:
                AddCarriageReturn(tokenizer, token, sb, previous);
                break;
            case RtfTokenType.Content:
                AddContent(tokenizer, token, sb, previous);
                break;
            default:
                Debug.Assert(false, "Unknown token type " + token.ToString());
                break;
        }
        return token;
    }

    private void AddPushGroup(RtfTokenizer tokenizer, RtfToken token, StringBuilder sb, RtfToken previous)
    {
        AddContent(tokenizer, token, sb, previous);
    }

    private void AddPopGroup(RtfTokenizer tokenizer, RtfToken token, StringBuilder sb, RtfToken previous)
    {
        AddContent(tokenizer, token, sb, previous);
    }

    const string binPrefix = @"\bin";

    bool IsBinaryToken(RtfToken token, out int binaryLength)
    {
        // Rich Text Format (RTF) Specification, Version 1.9.1, p 209:
        //      Remember that binary data can occur when you’re skipping RTF.
        //      A simple way to skip a group in RTF is to keep a running count of the opening braces the RTF reader 
        //      has encountered in the RTF stream. When the RTF reader sees an opening brace, it increments the count. 
        //      When the reader sees a closing brace, it decrements the count. When the count becomes negative, the end 
        //      of the group was found. Unfortunately, this does not work when the RTF file contains a \binN control; the 
        //      reader must explicitly check each control word found to see if it is a \binN control, and if found, 
        //      skip that many bytes before resuming its scanning for braces.
        if (string.CompareOrdinal(binPrefix, 0, token.Rtf, token.StartIndex, binPrefix.Length) == 0)
        {
            if (RtfTokenizer.IsControlWordNumericParameter(token, token.StartIndex + binPrefix.Length))
            {
                bool ok = int.TryParse(token.Rtf.Substring(token.StartIndex + binPrefix.Length, token.Length - binPrefix.Length),
                    NumberStyles.Integer, CultureInfo.InvariantCulture, 
                    out binaryLength);
                return ok;
            }
        }
        binaryLength = -1;
        return false;
    }

    private void AddControlWord(RtfTokenizer tokenizer, RtfToken token, StringBuilder sb, RtfToken previous)
    {
        // Carriage return, usually ignored.
        // Rich Text Format (RTF) Specification, Version 1.9.1, p 151:
        // RTF writers should not use the carriage return/line feed (CR/LF) combination to break up pictures 
        // in binary format. If they do, the CR/LF combination is treated as literal text and considered part of the picture data.
        AddContent(tokenizer, token, sb, previous);
        int binaryLength;
        if (IsBinaryToken(token, out binaryLength))
        {
            if (tokenizer.MoveFixedLength(binaryLength))
            {
                AddContent(tokenizer, tokenizer.Current, sb, previous);
            }
        }
    }

    private void AddControlSymbol(RtfTokenizer tokenizer, RtfToken token, StringBuilder sb, RtfToken previous)
    {
        AddContent(tokenizer, token, sb, previous);
    }

    private static bool? CanMergeToControlWord(RtfToken previous, RtfToken next)
    {
        if (previous.Type != RtfTokenType.ControlWord)
            throw new ArgumentException();
        if (next.Type == RtfTokenType.CRLF)
            return null; // Can't tell
        if (next.Type != RtfTokenType.Content)
            return false;
        if (previous.Length < 2)
            return false; // Internal error?
        if (next.Length < 1)
            return null; // Internal error?
        var lastCh = previous.Rtf[previous.StartIndex + previous.Length - 1];
        var nextCh = next.Rtf[next.StartIndex];
        if (RtfTokenizer.IsAsciiLetter(lastCh))
        {
            return RtfTokenizer.IsAsciiLetter(nextCh) || RtfTokenizer.IsAsciiMinus(nextCh) || RtfTokenizer.IsAsciiDigit(nextCh);
        }
        else if (RtfTokenizer.IsAsciiMinus(lastCh))
        {
            return RtfTokenizer.IsAsciiDigit(nextCh);
        }
        else if (RtfTokenizer.IsAsciiDigit(lastCh))
        {
            return RtfTokenizer.IsAsciiDigit(nextCh);
        }
        else
        {
            Debug.Assert(false, "unknown final character for control word token \"" + previous.ToString() + "\"");
            return false;
        }
    }

    bool IgnoredDelimiterIsRequired(RtfTokenizer tokenizer, RtfToken token, RtfToken previous)
    {
        // Word inserts required delimiters when required, and optional delimiters for beautification 
        // and readability.  Strip the optional delimiters while retaining the required ones.
        if (previous.Type != RtfTokenType.ControlWord)
            return false;
        var current = tokenizer.Current;
        try
        {
            while (tokenizer.MoveNext())
            {
                var next = tokenizer.Current;
                var canMerge = CanMergeToControlWord(previous, next);
                if (canMerge == null)
                    continue;
                return canMerge.Value;
            }
        }
        finally
        {
            tokenizer.MoveTo(current);
        }
        return false;
    }

    private void AddIgnoredDelimiter(RtfTokenizer tokenizer, RtfToken token, StringBuilder sb, RtfToken previous)
    {
        // Rich Text Format (RTF) Specification, Version 1.9.1, p 151:
        // an RTF file does not have to contain any carriage return/line feed pairs (CRLFs) and CRLFs should be ignored by RTF readers except that 
        // they can act as control word delimiters. RTF files are more readable when CRLFs occur at major group boundaries.
        //
        // but then later:
        // 
        // If a single space delimits the control word, the space does not appear in the document (it’s ignored). Any characters following the single space delimiter, including any subsequent spaces, 
        // will appear as text or spaces in the document. For this reason, you should use spaces only where necessary. It is recommended to avoid spaces as a means of breaking up RTF syntax to make 
        // it easier to read. You can use paragraph marks (CR, LF, or CRLF) to break up lines without changing the meaning except in destinations that contain \binN. 
        // In this document, a control word that takes a numeric parameter N is written with the N, as shown here for \binN, unless the control word appears with an explicit value. The only exceptions to 
        // this are “toggle” control words like \b (bold toggle), which have only two states. When such a control word has no parameter or has a nonzero parameter, the control word turns the property on. 
        // When such a control word has a parameter of 0, the control word turns the property off. For example, \b turns on bold and \b0 turns off bold. In the definitions of these toggle control words, 
        // the control word names are followed by an asterisk.
        if (IgnoredDelimiterIsRequired(tokenizer, token, previous))
            // There *May* be a need for a delimiter, 
            AddContent(tokenizer, " ", sb, previous);
    }

    private void AddCarriageReturn(RtfTokenizer tokenizer, RtfToken token, StringBuilder sb, RtfToken previous)
    {
        // DO NOTHING.
    }

    private void AddContent(RtfTokenizer tokenizer, RtfToken token, StringBuilder sb, RtfToken previous)
    {
        sb.Append(token.ToString());
    }

    private void AddContent(RtfTokenizer tokenizer, string content, StringBuilder sb, RtfToken previous)
    {
        sb.Append(content);
    }
}

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;

public enum RtfTokenType
{
    None = 0,
    StartGroup,
    EndGroup,
    CRLF,
    ControlWord,
    ControlSymbol,
    IgnoredDelimiter,
    Content,
}

public struct RtfToken : IEquatable<RtfToken>
{
    public static RtfToken None { get { return new RtfToken(); } }

    public RtfToken(RtfTokenType type, int startIndex, int length, string rtf)
        : this()
    {
        this.Type = type;
        this.StartIndex = startIndex;
        this.Length = length;
        this.Rtf = rtf;
    }
    public RtfTokenType Type { get; private set; }

    public int StartIndex { get; private set; }

    public int Length { get; private set; }

    public string Rtf { get; private set; }

    public bool IsEmpty { get { return Rtf == null; } }

    #region IEquatable<RtfToken> Members

    public bool Equals(RtfToken other)
    {
        if (this.Type != other.Type)
            return false;
        if (this.Length != other.Length)
            return false;
        if (this.IsEmpty)
            return other.IsEmpty;
        else 
            return string.CompareOrdinal(this.Rtf, StartIndex, other.Rtf, other.StartIndex, Length) == 0;
    }

    public static bool operator ==(RtfToken first, RtfToken second)
    {
        return first.Equals(second);
    }

    public static bool operator !=(RtfToken first, RtfToken second)
    {
        return !first.Equals(second);
    }
    #endregion

    public override string ToString()
    {
        if (Rtf == null)
            return string.Empty;
        return Rtf.Substring(StartIndex, Length);
    }

    public override bool Equals(object obj)
    {
        if (obj is RtfToken)
            return Equals((RtfToken)obj);
        return false;
    }

    public override int GetHashCode()
    {
        if (Rtf == null)
            return 0;
        int code = Type.GetHashCode() ^ Length.GetHashCode();
        for (int i = StartIndex; i < Length; i++)
            code ^= Rtf[i].GetHashCode();
        return code;
    }
}

public class RtfTokenizer : IEnumerator<RtfToken> 
{
    public RtfTokenizer(string rtf)
    {
        if (rtf == null)
            throw new ArgumentNullException();
        Rtf = rtf;
    }

    public string Rtf { get; private set; }

#if false
    Rich Text Format (RTF) Specification, Version 1.9.1:
    Control Word
    An RTF control word is a specially formatted command used to mark characters for display on a monitor or characters destined for a printer. A control word’s name cannot be longer than 32 letters. 
    A control word is defined by:
    \<ASCII Letter Sequence><Delimiter>
    where <Delimiter> marks the end of the control word’s name. For example:
    \par
    A backslash begins each control word and the control word is case sensitive.
    The <ASCII Letter Sequence> is made up of ASCII alphabetical characters (a through z and A through Z). Control words (also known as keywords) originally did not contain any uppercase characters, however in recent years uppercase characters appear in some newer control words.
    The <Delimiter> can be one of the following:
    •   A space. This serves only to delimit a control word and is ignored in subsequent processing.
    •   A numeric digit or an ASCII minus sign (-), which indicates that a numeric parameter is associated with the control word. The subsequent digital sequence is then delimited by any character other than an ASCII digit (commonly another control word that begins with a backslash). The parameter can be a positive or negative decimal number. The range of the values for the number is nominally –32768 through 32767, i.e., a signed 16-bit integer. A small number of control words take values in the range −2,147,483,648 to 2,147,483,647 (32-bit signed integer). These control words include \binN, \revdttmN, \rsidN related control words and some picture properties like \bliptagN. Here N stands for the numeric parameter. An RTF parser must allow for up to 10 digits optionally preceded by a minus sign. If the delimiter is a space, it is discarded, that is, it’s not included in subsequent processing.
    •   Any character other than a letter or a digit. In this case, the delimiting character terminates the control word and is not part of the control word. Such as a backslash “\”, which means a new control word or a control symbol follows.
    If a single space delimits the control word, the space does not appear in the document (it’s ignored). Any characters following the single space delimiter, including any subsequent spaces, will appear as text or spaces in the document. For this reason, you should use spaces only where necessary. It is recommended to avoid spaces as a means of breaking up RTF syntax to make it easier to read. You can use paragraph marks (CR, LF, or CRLF) to break up lines without changing the meaning except in destinations that contain \binN. 
    In this document, a control word that takes a numeric parameter N is written with the N, as shown here for \binN, unless the control word appears with an explicit value. The only exceptions to this are “toggle” control words like \b (bold toggle), which have only two states. When such a control word has no parameter or has a nonzero parameter, the control word turns the property on. When such a control word has a parameter of 0, the control word turns the property off. For example, \b turns on bold and \b0 turns off bold. In the definitions of these toggle control words, the control word names are followed by an asterisk.
#endif

    public static bool IsAsciiLetter(char ch)
    {
        if (ch >= 'a' && ch <= 'z')
            return true;
        if (ch >= 'A' && ch <= 'Z')
            return true;
        return false;
    }

    public static bool IsAsciiDigit(char ch)
    {
        if (ch >= '0' && ch <= '9')
            return true;
        return false;
    }

    public static bool IsAsciiMinus(char ch)
    {
        return ch == '-';
    }

    public static bool IsControlWordNumericParameter(RtfToken token, int startIndex)
    {
        int inLength = token.Length - startIndex;
        int actualLength;
        if (IsControlWordNumericParameter(token.Rtf, token.StartIndex + startIndex, out actualLength)
            && actualLength == inLength)
        {
            return true;
        }
        return false;
    }

    static bool IsControlWordNumericParameter(string rtf, int startIndex, out int length)
    {
        int index = startIndex;
        if (index < rtf.Length - 1 && IsAsciiMinus(rtf[index]) && IsAsciiDigit(rtf[index + 1]))
            index++;
        for (; index < rtf.Length && IsAsciiDigit(rtf[index]); index++)
            ;
        length = index - startIndex;
        return length > 0;
    }

    static bool IsControlWord(string rtf, int startIndex, out int length)
    {
        int index = startIndex;
        for (; index < rtf.Length && IsAsciiLetter(rtf[index]); index++)
            ;
        length = index - startIndex;
        if (length == 0)
            return false;
        int paramLength;
        if (IsControlWordNumericParameter(rtf, index, out paramLength))
            length += paramLength;
        return true;
    }

    public IEnumerable<RtfToken> AsEnumerable()
    {
        int oldPos = nextPosition;
        RtfToken oldCurrent = current;
        try
        {
            while (MoveNext())
                yield return Current;
        }
        finally
        {
            nextPosition = oldPos;
            current = oldCurrent;
        }
    }

    string RebuildRtf()
    {
        string newRtf = AsEnumerable().Aggregate(new StringBuilder(), (sb, t) => sb.Append(t.ToString())).ToString();
        return newRtf;
    }

    [Conditional("DEBUG")]
    public void AssertValid()
    {
        var newRtf = RebuildRtf();
        if (Rtf != newRtf)
        {
            Debug.Assert(false, "rebuilt rtf mismatch");
        }
    }

    #region IEnumerator<RtfToken> Members

    int nextPosition = 0;
    RtfToken current = new RtfToken();

    public RtfToken Current
    {
        get {
            return current;
        }
    }

    #endregion

    #region IDisposable Members

    public void Dispose()
    {
    }

    #endregion

    #region IEnumerator Members

    object System.Collections.IEnumerator.Current
    {
        get { return Current; }
    }

    public void MoveTo(RtfToken token)
    {
        if (token.Rtf != Rtf)
            throw new ArgumentException();
        nextPosition = token.StartIndex + token.Length;
        current = token;
    }

    public bool MoveFixedLength(int length)
    {
        if (nextPosition >= Rtf.Length)
            return false;
        int actualLength = Math.Min(length, Rtf.Length - nextPosition);
        current = new RtfToken(RtfTokenType.Content, nextPosition, actualLength, Rtf);
        nextPosition += actualLength;
        return true;
    }

    static string crlf = "\r\n";

    static bool IsCRLF(string rtf, int startIndex)
    {
        return string.CompareOrdinal(crlf, 0, rtf, startIndex, crlf.Length) == 0;
    }

    public bool MoveNext()
    {
        // As previously mentioned, the backslash (\) and braces ({ }) have special meaning in RTF. To use these characters as text, precede them with a backslash, as in the control symbols \\, \{, and \}.
        if (nextPosition >= Rtf.Length)
            return false;
        RtfToken next = new RtfToken();

        if (Rtf[nextPosition] == '{')
        {
            next = new RtfToken(RtfTokenType.StartGroup, nextPosition, 1, Rtf);
        }
        else if (Rtf[nextPosition] == '}')
        {
            // End group
            next = new RtfToken(RtfTokenType.EndGroup, nextPosition, 1, Rtf);
        }
        else if (IsCRLF(Rtf, nextPosition))
        {
            if (current.Type == RtfTokenType.ControlWord)
                next = new RtfToken(RtfTokenType.IgnoredDelimiter, nextPosition, crlf.Length, Rtf);
            else
                next = new RtfToken(RtfTokenType.CRLF, nextPosition, crlf.Length, Rtf);
        }
        else if (Rtf[nextPosition] == ' ')
        {
            if (current.Type == RtfTokenType.ControlWord)
                next = new RtfToken(RtfTokenType.IgnoredDelimiter, nextPosition, 1, Rtf);
            else
                next = new RtfToken(RtfTokenType.Content, nextPosition, 1, Rtf);
        }
        else if (Rtf[nextPosition] == '\\')
        {
            if (nextPosition == Rtf.Length - 1)
                next = new RtfToken(RtfTokenType.Content, nextPosition, 1, Rtf); // Junk file?
            else
            {
                int length;
                if (IsControlWord(Rtf, nextPosition + 1, out length))
                {
                    next = new RtfToken(RtfTokenType.ControlWord, nextPosition, length + 1, Rtf);
                }
                else
                {
                    // Control symbol.
                    next = new RtfToken(RtfTokenType.ControlSymbol, nextPosition, 2, Rtf);
                }
            }
        }
        else
        {
            // Content
            next = new RtfToken(RtfTokenType.Content, nextPosition, 1, Rtf);
        }

        if (next.Length == 0)
            throw new Exception("internal error");
        current = next;
        nextPosition = next.StartIndex + next.Length;
        return true;
    }

    public void Reset()
    {
        nextPosition = 0;
    }

    #endregion
}

这解决了许多相同复制操作之间差异的错误报告 - 但是在复制多行列表或表时仍然存在一些错误报告。出于某种原因,似乎Word根本不会为看似相同的副本的长而复杂的格式生成相同的RTF。

您可能需要研究一种不同的方法,例如将RTF粘贴到RichTextBox中,然后比较生成的XAML。