Question

我们的用户有时会给我们拼写错误的名字/用户名，我希望能够搜索活动目录中的近似匹配，按最近排序（任何算法都可以）。例如，如果我尝试

Get-Aduser -Filter {GivenName -like "Jack"}

我可以找到用户杰克，但如果我使用＆＃34; Jacck＆＃34;或＆＃34; ack＆＃34;

有一种简单的方法吗？

Answer 1

您可以计算两个字符串之间的Levenshtein distance，并确保它低于某个阈值（可能是1或2）。这里有一个PowerShell示例： Levenshtein distance in powershell

示例：

Jack和Jacck的LD为1.
Jack和ack的LD为1.
Palle和Havnefoged的LD为8。

Answer 2

有趣的问题和答案。但是一个可能更简单的解决方案是搜索多个属性，因为我希望大多数人能正确拼写其中一个名称：）

Get-ADUser -Filter {GivenName -like "FirstName" -or SurName -Like "SecondName"}

Answer 3

Soundex算法就是针对这种情况而设计的。以下是一些可能有用的PowerShell代码：

Get-Soundex.ps1

Answer 4

好的，根据我得到的好答案（感谢@boxdog和@Palle Due），我发布了一个更完整的答案。

主要来源：https://github.com/gravejester/Communary.PASM - PowerShell近似字符串匹配。这个主题的伟大模块。

1）FuzzyMatchScore函数

来源：https://github.com/gravejester/Communary.PASM/tree/master/Functions

# download functions to the temp folder
$urls = 
"https://raw.githubusercontent.com/gravejester/Communary.PASM/master/Functions/Get-CommonPrefix.ps1"    ,
"https://raw.githubusercontent.com/gravejester/Communary.PASM/master/Functions/Get-LevenshteinDistance.ps1" ,
"https://raw.githubusercontent.com/gravejester/Communary.PASM/master/Functions/Get-LongestCommonSubstring.ps1"  ,
"https://raw.githubusercontent.com/gravejester/Communary.PASM/master/Functions/Get-FuzzyMatchScore.ps1" 

$paths = $urls | %{$_.split("\/")|select -last 1| %{"$env:TEMP\$_"}}

[Net.ServicePointManager]::SecurityProtocol = [Net.SecurityProtocolType]::Tls12
for($i=0;$i -lt $urls.count;$i++){
Invoke-WebRequest -Uri $urls[$i] -OutFile $paths[$i]
}

# concatenating the functions so we don't have to deal with source permissions
foreach($path in $paths){
cat $path | Add-Content "$env:TEMP\Fuzzy_score_functions.ps1"
}

# to save for later, open the temp folder with: Invoke-Item $env:TEMP 
# then copy "Fuzzy_score_functions.ps1" somewhere else

# source Fuzzy_score_functions.ps1
. "$env:TEMP\Fuzzy_score_functions.ps1"

简单测试：

Get-FuzzyMatchScore "a" "abc" # 98

创建评分功能：

## start function
function get_score{
param($searchQuery,$searchData,$nlist,[switch]$levd)

if($nlist -eq $null){$nlist = 10}

$scores = foreach($string in $searchData){
    Try{
    if($levd){    
        $score = Get-LevenshteinDistance $searchQuery $string }
    else{
        $score = Get-FuzzyMatchScore -Search $searchQuery -String $string }
    Write-Output (,([PSCustomObject][Ordered] @{
                        Score = $score
                        Result = $string
                    }))
    $I = $searchData.indexof($string)/$searchData.count*100
    $I = [math]::Round($I)
    Write-Progress -Activity "Search in Progress" -Status "$I% Complete:" -PercentComplete $I
    }Catch{Continue}
}

if($levd) { $scores | Sort-Object Score,Result |select -First $nlist }
else {$scores | Sort-Object Score,Result -Descending |select -First $nlist }
} ## end function

实施例

get_score "Karolin" @("Kathrin","Jane","John","Cameron")

# check the difference between Fuzzy and LevenshteinDistance mode
$names = "Ferris","Cameron","Sloane","Jeanie","Edward","Tom","Katie","Grace"
"Fuzzy"; get_score "Cam" $names
"Levenshtein"; get_score "Cam" $names -levd

测试大数据集的性能

## donload baby-names

$url = "https://github.com/hadley/data-baby-names/raw/master/baby-names.csv"
$output = "$env:TEMP\baby-names.csv"
[Net.ServicePointManager]::SecurityProtocol = [Net.SecurityProtocolType]::Tls12
Invoke-WebRequest -Uri $url -OutFile $output
$babynames = import-csv "$env:TEMP\baby-names.csv"
$babynames.count # 258000 lines

$babynames[0..3] # year, name, percent, sex

$searchdata = $babynames.name[0..499]

$query = "Waren" # missing letter
"Fuzzy"; get_score $query $searchdata
"Levenshtein"; get_score $query $searchdata -levd

$query = "Jon" # missing letter
"Fuzzy"; get_score $query $searchdata
"Levenshtein"; get_score $query $searchdata -levd

$query = "Howie" # lookalike
"Fuzzy"; get_score $query $searchdata;
"Levenshtein"; get_score $query $searchdata -levd

测试

$query = "John"

$res = for($i=1;$i -le 10;$i++){
    $searchdata = $babynames.name[0..($i*100-1)]
    $meas = measure-command{$res = get_score $query $searchdata}
    write-host $i
    Write-Output (,([PSCustomObject][Ordered] @{
        N = $i*100
        MS = $meas.Milliseconds
        MS_per_line = [math]::Round($meas.Milliseconds/$searchdata.Count,2)
                    }))
}
$res

+------+-----+-------------+
| N    | MS  | MS_per_line |
| -    | --  | ----------- |
| 100  | 696 | 6.96        |
| 200  | 544 | 2.72        |
| 300  | 336 | 1.12        |
| 400  | 6   | 0.02        |
| 500  | 718 | 1.44        |
| 600  | 452 | 0.75        |
| 700  | 224 | 0.32        |
| 800  | 912 | 1.14        |
| 900  | 718 | 0.8         |
| 1000 | 417 | 0.42        |
+------+-----+-------------+

这些时间非常疯狂，如果有人理解为什么请评论它。

2）从Active Directory生成名称表

这样做的最佳方式取决于AD的组织。这里我们有很多OU，但普通用户将在Users和DisabledUsers中。此外，域和DC也会有所不同（我将此处更改为<domain>和<DC>）。

# One way to get a List of OUs
Get-ADOrganizationalUnit -Filter * -Properties CanonicalName | 
  Select-Object -Property CanonicalName

然后您可以使用Where-Object -FilterScript {}来过滤每个OU

# example, saving on the temp folder
Get-ADUser -f * |
 Where-Object -FilterScript {
    ($_.DistinguishedName -match "CN=\w*,OU=DisabledUsers,DC=<domain>,DC=<DC>" -or
    $_.DistinguishedName -match "CN=\w*,OU=Users,DC=<domain>,DC=<DC>") -and
    $_.GivenName -ne $null #remove users without givenname, like test users
    } | 
    select @{n="Fullname";e={$_.GivenName+" "+$_.Surname}},
    GivenName,Surname,SamAccountName |
    Export-CSV -Path "$env:TEMP\all_Users.csv" -NoTypeInformation
# you can open the file to inspect 
Invoke-Item "$env:TEMP\all_Users.csv"
# import
$allusers = Import-Csv "$env:TEMP\all_Users.csv"
$allusers.Count # number of lines

用法：

get_score "Jane Done" $allusers.fullname 15 # return the 15 first
get_score "jdoe" $allusers.samaccountname 15

Active Directory / Powershell中名称和/或用户名的部分/近似匹配

4 个答案:

1）FuzzyMatchScore函数

2）从Active Directory生成名称表