我正在使用IEqualityComparer
来匹配使用LINQ to Entities的数据库中的“near duplicatelicates”。
如果记录集大约为40,000,则此查询大约需要15秒才能完成,我想知道是否可以对下面的代码进行任何结构更改。
我的公开方法
public List<LeadGridViewModel> AllHighlightingDuplicates(int company)
{
var results = AllLeads(company)
.GroupBy(c => c, new CompanyNameIgnoringSpaces())
.Select(g => new LeadGridViewModel
{
LeadId = g.First().LeadId,
Qty = g.Count(),
CompanyName = g.Key.CompanyName
}).OrderByDescending(x => x.Qty).ToList();
return results;
}
获取潜在客户的私密方法
private char[] delimiters = new[] { ' ', '-', '*', '&', '!' };
private IEnumerable<LeadGridViewModel> AllLeads(int company)
{
var items = (from t1 in db.Leads
where
t1.Company_ID == company
select new LeadGridViewModel
{
LeadId = t1.Lead_ID,
CompanyName = t1.Company_Name,
}).ToList();
foreach (var x in items)
x.CompanyNameStripped = string.Join("", (x.CompanyName ?? String.Empty).Split(delimiters));
return items;
}
我的IEqualityComparer
public class CompanyNameIgnoringSpaces : IEqualityComparer<LeadGridViewModel>
{
public bool Equals(LeadGridViewModel x, LeadGridViewModel y)
{
var delimiters = new[] {' ', '-', '*', '&', '!'};
return delimiters.Aggregate(x.CompanyName ?? String.Empty, (c1, c2) => c1.Replace(c2, '\0'))
== delimiters.Aggregate(y.CompanyName ?? String.Empty, (c1, c2) => c1.Replace(c2, '\0'));
}
public int GetHashCode(LeadGridViewModel obj)
{
var delimiters = new[] {' ', '-', '*', '&', '!'};
return delimiters.Aggregate(obj.CompanyName ?? String.Empty, (c1, c2) => c1.Replace(c2, '\0')).GetHashCode();
}
}
答案 0 :(得分:2)
您可以使用Regex.Replace
一次性执行所有替换:
public class CompanyNameIgnoringSpaces : IEqualityComparer<LeadGridViewModel>
{
static Regex replacer = new Regex("[ -*&!]");
public bool Equals(LeadGridViewModel x, LeadGridViewModel y)
{
return replacer.Replace(x.CompanyName, "")
== replacer.Replace(y.CompanyName, "");
}
public int GetHashCode(LeadGridViewModel obj)
{
return replacer.Replace(obj.CompanyName, "").GetHashCode();
}
}
可能更快;试试看! (还要注意我已经跳过了空检查,你可能想以某种方式将它们放回去。)
答案 1 :(得分:2)
一种方法是在DB上创建一个计算列,该列是公司名称,其中包含不需要的字符。
然后使用此列进行过滤。
这可能会略微降低插件的性能,但应大大缩短查询时间。