Question

我有一个字符串比较问题 - 在大多数情况下 - 表现得如预期的那样，但是由于我的代码没有将字符串对检测为重复，因此留下了大量重复数据库插入。

我以为我把它缩小到一个文化问题（西里尔字符），我解决了，但我现在得到'假阴性'（两个显然相等的字符串显示为不相等）。

我查看了以下类似问题并尝试了以下比较方法。

我检查过类似的问题：

以下是比较字符串的示例:(标题和说明）

Feed标题： Ellsberg：他是英雄

feed desc： Daniel Ellsberg告诉美国有线电视新闻网的Don Lemon，美国国家安全局的骗子爱德华·斯诺登表现出了勇气，做了大量的服务。

db title： Ellsberg：他是英雄

db desc： Daniel Ellsberg告诉美国有线电视新闻网的Don Lemon，国家安全局的爱德华·斯诺登表现出了勇气，   做了大量的服务。

我的应用将从RSS Feed获取的值与我在数据库中的值进行比较，并且只应插入“新”值。

//fetch existing articles from DB for the current feed:
    List<Article> thisFeedArticles = (from ar in entities.Items
                                      where (ar.ItemTypeId == (int)Enums.ItemType.Article) && ar.ParentId == feed.FeedId
                                      && ar.DatePublished > datelimit
                                      select new Article
                                      {
                                           Title = ar.Title, 
                                           Description = ar.Blurb
                                      }).ToList();

以下比较的每个人都显示与Ellsberg标题/描述不匹配。即matches1到matches6都有Count()==0

（请原谅列举的变量名称 - 它们仅用于测试）

   // comparison methods 
CompareOptions compareOptions = CompareOptions.OrdinalIgnoreCase;
CompareOptions compareOptions2 = CompareOptions.IgnoreSymbols | CompareOptions.IgnoreNonSpace;
//1
IEnumerable<Article> matches = thisFeedArticles.Where(b =>
    String.Compare(b.Title.Trim().Normalize(), a.Title.Trim().Normalize(), CultureInfo.InvariantCulture, compareOptions) == 0 &&
    String.Compare(b.Description.Trim().Normalize(), a.Description.Trim().Normalize(), CultureInfo.InvariantCulture, compareOptions) == 0
    );

//2
IEnumerable<Article> matches2 = thisFeedArticles.Where(b =>
    String.Compare(b.Title, a.Title, CultureInfo.CurrentCulture, compareOptions2) == 0 &&
    String.Compare(b.Description, a.Description, CultureInfo.CurrentCulture, compareOptions2) == 0
    );

//3
IEnumerable<Article> matches3 = thisFeedArticles.Where(b =>
    String.Compare(b.Title, a.Title, StringComparison.OrdinalIgnoreCase) == 0 &&
    String.Compare(b.Description, a.Description, StringComparison.OrdinalIgnoreCase) == 0
    );

//4
IEnumerable<Article> matches4 = thisFeedArticles.Where(b =>
    b.Title.Equals(a.Title, StringComparison.OrdinalIgnoreCase) &&
    b.Description.Equals(a.Description, StringComparison.OrdinalIgnoreCase)
    );

//5
IEnumerable<Article> matches5 = thisFeedArticles.Where(b =>
    b.Title.Trim().Equals(a.Title.Trim(), StringComparison.InvariantCultureIgnoreCase) &&
    b.Description.Trim().Equals(a.Description.Trim(), StringComparison.InvariantCultureIgnoreCase)
    );

//6
IEnumerable<Article> matches6 = thisFeedArticles.Where(b =>
    b.Title.Trim().Normalize().Equals(a.Title.Trim().Normalize(), StringComparison.OrdinalIgnoreCase) &&
    b.Description.Trim().Normalize().Equals(a.Description.Trim().Normalize(), StringComparison.OrdinalIgnoreCase)
    );


    if (matches.Count() == 0 && matches2.Count() == 0 && matches3.Count() == 0 && matches4.Count() == 0 && matches5.Count() == 0 && matches6.Count() == 0 && matches7.Count() == 0)
    {
    //insert values
    }

    //this if statement was the first approach
    //if (!thisFeedArticles.Any(b => b.Title == a.Title && b.Description == a.Description)
    // {
    // insert
    // }

显然我一次只使用上述选项之一。

在大多数情况下，上述选项确实有效，并且检测到大多数重复项，但仍然存在重复的问题 - 我只需要了解“裂缝”是什么，所以任何建议都会受到欢迎。 / p>

我甚至尝试将字符串转换为字节数组并进行比较（暂时删除该代码，抱歉）。

Article对象如下：

    public class Article
    {
        public string Title;
        public string Description;
    }

更新

我已经尝试了规范化字符串以及包含IgnoreSymbols CompareOption，但我仍然得到假阴性（不匹配）。我注意到的是，撇号似乎在虚假的不匹配中表现出一致的外观;所以我认为这可能是撇号与单引号的情况，即'vs'（等等），但IgnoreSymbols肯定应该避免这种情况吗？

我发现了几个类似的SO帖子： C# string comparison ignoring spaces, carriage return or line breaks String comparison: InvariantCultureIgnoreCase vs OrdinalIgnoreCase? 下一步：根据以下答案尝试使用正则表达式去除空格：https://stackoverflow.com/a/4719009/2261245

更新2 在6比较之后，STILL没有返回任何匹配，我意识到必须有另一个因素扭曲结果，所以我尝试了以下

//7
IEnumerable<Article> matches7 = thisFeedArticles.Where(b =>
    Regex.Replace(b.Title, "[^0-9a-zA-Z]+", "").Equals(Regex.Replace(a.Title, "[^0-9a-zA-Z]+", ""), StringComparison.InvariantCultureIgnoreCase) &&
    Regex.Replace(b.Description, "[^0-9a-zA-Z]+", "").Equals(Regex.Replace(a.Description, "[^0-9a-zA-Z]+", ""), StringComparison.InvariantCultureIgnoreCase)
    );

这可以找到其他人错过的比赛！

下面的字符串完成了所有6次比较，但不是第7次：

a.Title.Trim().Normalize()和a.Title.Trim()都返回：

“更正：鉴定独特的TGF-β依赖性分子和小胶质细胞中的功能特征“

DB中的值是：

“更正：鉴定一种独特的TGF-β依赖性分子和小胶质细胞中的功能特征“

仔细检查显示德国人的'eszett'特征在数据库中与来自饲料的东西不同：β对ß

我原本预计至少有一个比较1-6会选择那个......

有趣的是，经过一些性能比较后，Regex选项绝不是七个中最慢的选项。 Normalize似乎比正则表达更加密集！当Stopwatch对象包含12077项

时，以下是所有七个的thisFeedArticles个持续时间

经过的时间：00：00：00.0000662
  已过时间：00：00：00.0000009
  已过时间：00：00：00.0000009
  已过时间：00：00：00.0000009
  已过时间：00：00：00.0000009
  已过时间：00：00：00.0000009
  时间流逝：00：00：00.0000016

Answer 1

Unicode字符串可以是＆＃34;二进制＆＃34;不同的，即使它们在语义上是＆＃34;同样的。

尝试规范化字符串。有关详细信息，请参阅http://msdn.microsoft.com/en-us/library/System.String.Normalize.aspx

C＃String Comparison等于false

1 个答案: