我的应用程序中有一个员工列表。每个员工都有名字和姓氏,所以我有一个元素列表,如:
["Jim Carry", "Uma Turman", "Bill Gates", "John Skeet"]
我希望我的客户具有使用模糊搜索算法按名称搜索员工的功能。例如,如果用户输入“ Yuma Turmon”,则将返回最接近的元素“ Uma Turman”。我使用了Levenshtein距离算法,发现here。
static class LevenshteinDistance
{
/// <summary>
/// Compute the distance between two strings.
/// </summary>
public static int Compute(string s, string t)
{
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
// Step 1
if (n == 0)
{
return m;
}
if (m == 0)
{
return n;
}
// Step 2
for (int i = 0; i <= n; d[i, 0] = i++)
{
}
for (int j = 0; j <= m; d[0, j] = j++)
{
}
// Step 3
for (int i = 1; i <= n; i++)
{
//Step 4
for (int j = 1; j <= m; j++)
{
// Step 5
int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
// Step 6
d[i, j] = Math.Min(
Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
// Step 7
return d[n, m];
}
}
我在员工姓名列表上迭代用户的输入(全名)并比较距离。例如,如果它小于3,我将返回找到的雇员。
现在,我希望允许用户按反向名称进行搜索-例如,如果用户输入“ Turmon Uma”,则它将返回“ Uma Turman”,因为实际的实际距离是1,因为名字和姓氏与姓氏相同名称和名字。我的算法现在把它算作很远的不同字符串。我该如何修改它以便找到名称而不受顺序限制?
答案 0 :(得分:2)
您可以使用LINQ创建雇员姓名的反向版本。例如,如果您有
之类的员工列表x = ["Jim Carry", "Uma Turman", "Bill Gates", "John Skeet"]
您可以编写以下代码:
var reversedNames = x.Select(p=> $"{p.Split(' ')[1] p.Split(' ')[0]}");
它将返回相反的版本,例如:
xReversed = ["Carry Jim", "Turman Uma", "Gates Bill", "Skeet John"]
然后使用此数据重复您的算法。
答案 1 :(得分:1)
一些想法,因为这是一个可能会变得很正确的问题:
John Smith
,则找到John
的最佳单个单词名称匹配,然后匹配Smith
上那些“最匹配”员工的其余名称,并得出总和距离。然后找到Smith
的最佳匹配项,并匹配John
上的其余名称,然后求和距离。最佳匹配是总距离最小的匹配。您可以通过返回前10位(例如按总距离排序)来提供最佳匹配列表。与数据库中的名称或搜索词的周围方式无关。实际上,它们可能完全乱了,没关系。á
。您的算法无法与它们正确配合。如果您希望使用非字母双字节字符,请格外小心,例如。中文,日文,阿拉伯文等拆分每个雇员的姓名还有两个好处:
Wells-Harvey
,复合词(WellsHarvey
)和单个名称(Wells
和Harvey
分开存储)同一位员工。在任何一个名字上进行低距离比赛就是在雇员上进行一次低距离比赛,多余的名字也不计入总数。下面的一些基本代码似乎有效,但是实际上只考虑了第1、2和4点:
using System;
using System.Collections.Generic;
using System.Linq;
namespace EmployeeSearch
{
static class Program
{
static List<string> EmployeesList = new List<string>() { "Jim Carrey", "Uma Thurman", "Bill Gates", "Jon Skeet" };
static Dictionary<int, List<string>> employeesById = new Dictionary<int, List<string>>();
static Dictionary<string, List<int>> employeeIdsByName = new Dictionary<string, List<int>>();
static void Main()
{
Init();
var results = FindEmployeeByNameFuzzy("Umaa Thurrmin");
// Returns:
// (1) Uma Thurman Distance: 3
// (0) Jim Carrey Distance: 10
// (3) Jon Skeet Distance: 11
// (2) Bill Gates Distance: 12
Console.WriteLine(string.Join("\r\n", results.Select(r => $"({r.Id}) {r.Name} Distance: {r.Distance}")));
var results = FindEmployeeByNameFuzzy("Tormin Oma");
// Returns:
// (1) Uma Thurman Distance: 4
// (3) Jon Skeet Distance: 7
// (0) Jim Carrey Distance: 8
// (2) Bill Gates Distance: 9
Console.WriteLine(string.Join("\r\n", results.Select(r => $"({r.Id}) {r.Name} Distance: {r.Distance}")));
Console.Read();
}
private static void Init() // prepare our lists
{
for (int i = 0; i < EmployeesList.Count; i++)
{
// Preparing the list of names for each employee - add special cases such as hyphenation here as well
var names = EmployeesList[i].ToLower().Split(new char[] { ' ' }).ToList();
employeesById.Add(i, names);
// This is not used here, but could come in handy if you want a unique index of names pointing to employee ids for optimisation:
foreach (var name in names)
{
if (employeeIdsByName.ContainsKey(name))
{
employeeIdsByName[name].Add(i);
}
else
{
employeeIdsByName.Add(name, new List<int>() { i });
}
}
}
}
private static List<SearchResult> FindEmployeeByNameFuzzy(string query)
{
var results = new List<SearchResult>();
// Notice we're splitting the search terms the same way as we split the employee names above (could be refactored out into a helper method)
var searchterms = query.ToLower().Split(new char[] { ' ' });
// Comparison with each employee
for (int i = 0; i < employeesById.Count; i++)
{
var r = new SearchResult() { Id = i, Name = EmployeesList[i] };
var employeenames = employeesById[i];
foreach (var searchterm in searchterms)
{
int min = searchterm.Length;
// for each search term get the min distance for all names for this employee
foreach (var name in employeenames)
{
var distance = LevenshteinDistance.Compute(searchterm, name);
min = Math.Min(min, distance);
}
// Sum the minimums for all search terms
r.Distance += min;
}
results.Add(r);
}
// Order by lowest distance first
return results.OrderBy(e => e.Distance).ToList();
}
}
public class SearchResult
{
public int Distance { get; set; }
public int Id { get; set; }
public string Name { get; set; }
}
public static class LevenshteinDistance
{
/// <summary>
/// Compute the distance between two strings.
/// </summary>
public static int Compute(string s, string t)
{
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
// Step 1
if (n == 0)
{
return m;
}
if (m == 0)
{
return n;
}
// Step 2
for (int i = 0; i <= n; d[i, 0] = i++)
{
}
for (int j = 0; j <= m; d[0, j] = j++)
{
}
// Step 3
for (int i = 1; i <= n; i++)
{
//Step 4
for (int j = 1; j <= m; j++)
{
// Step 5
int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
// Step 6
d[i, j] = Math.Min(
Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
// Step 7
return d[n, m];
}
}
}
启动时只需致电Init()
,然后致电
var results = FindEmployeeByNameFuzzy(userquery);
返回最佳匹配项的有序列表。
免责声明:该代码不是最佳,并且仅经过简短测试,不检查是否为空,可能爆炸并杀死小猫等,等等。如果如果您有大量的员工,那么这可能会很慢。可以进行一些改进,例如,在遍历Levenshtein算法时,如果距离超过当前的最小距离,则可能会退出。