我试图在大文本文件中使用随机SSN屏蔽SSN号码。该文件是400M或.4演出。
我想找到并替换17,000个SSN实例。
以下是我正在使用的powershell脚本的示例。
(get-content C:\TrainingFile\TrainingFile.txt) | foreach-object {$_ -replace "123-45-6789", "666-66-6666"} | set-content C:\TrainingFile\TrainingFile.txt
我的问题是,我在.ps1文件中拥有17,000行此代码。 ps1文件看起来类似于
(get-content C:\TrainingFile\TrainingFile.txt) | foreach-object {$_ -replace "123-45-6789", "666-66-6666"} | set-content C:\TrainingFile\TrainingFile.txt
(get-content C:\TrainingFile\TrainingFile.txt) | foreach-object {$_ -replace "122-45-6789", "666-66-6668"} | set-content C:\TrainingFile\TrainingFile.txt
(get-content C:\TrainingFile\TrainingFile.txt) | foreach-object {$_ -replace "223-45-6789", "666-66-6667"} | set-content C:\TrainingFile\TrainingFile.txt
(get-content C:\TrainingFile\TrainingFile.txt) | foreach-object {$_ -replace "123-44-6789", "666-66-6669"} | set-content C:\TrainingFile\TrainingFile.txt
对于.ps1文件中的17,000个powershell命令。每行一个命令。
我只对一个命令进行了测试,并且执行了大约15个secoonds。做数学运算,170000 X 15秒出现大约3天来运行17,000个命令的.ps1脚本。
有更快的方法吗?
答案 0 :(得分:2)
表现不佳的原因是正在进行大量额外工作。让我们把这个过程看成像这样的伪算法,
select SSN (X) and masked SSN (X') from a list
read all rows from file
look each file row for string X
if found, replace with X'
save all rows to file
loop until all SSNs are processed
那么问题是什么?对于每个SSN替换,您将处理所有行。不仅需要屏蔽而是需要屏蔽。这是一项额外的工作。如果你有100行和10个替换,你只需要100步就可以使用1000步。此外,读取和保存文件会创建磁盘IO。对于单个操作而言,这通常不是问题,将IO成本与循环计数相乘,您会发现磁盘等待浪费了相当多的时间。
为了获得出色的性能,请调整算法,
read all rows from file
loop through rows
for current row, change X -> X'
save the result
为什么这会更快? 1)您阅读并保存文件一次。磁盘IO很慢。 2)您只处理每一行,因此没有进行额外的工作。至于如何实际执行X - > X'变换,你必须更仔细地定义掩蔽规则是什么。
修改强>
这是一个更实用的解决方案:
因为你已经知道f(X) - > X'结果,您应该将预先计算的列表保存到磁盘,如此,
ssn, mask
"123-45-6789", "666-66-6666"
...
"223-45-6789", "666-66-6667"
将文件导入哈希表并通过窃取来自Ansgar's answer的所有多汁位来继续前进,
$ssnMask = @{}
$ssn = import-csv "c:\temp\SSNMasks.csv" -delimiter ","
# Add X -> X' to hashtable
$ssn | % {
if(-not $ssnMask.ContainsKey($_.ssn)) {
# It's an error to add existing key, so check first
$ssnMask.Add($_.ssn, $_.mask)
}
}
$dataToMask = get-content "c:\temp\training.txt"
$dataToMask | % {
if ( $_ -match '(\d{3}-\d{2}-\d{4})' ) {
# Replace SSN look-a-like with value from hashtable
# NB: This simply removes SSNs that don't have a match in hashtable
$_ -replace $matches[1], $ssnMask[$matches[1]]
}
} | set-content "c:\temp\training2.txt"
答案 1 :(得分:0)
避免多次读写文件。 I / O很昂贵,这会降低你的脚本速度。尝试这样的事情:
$filename = 'C:\TrainingFile\TrainingFile.txt'
$ssnMap = @{}
(Get-Content $filename) | % {
if ( $_ -match '(\d{3}-\d{2}-\d{4})' ) {
# If SSN is found, check if a mapping of that SSN to a random SSN exists.
# Otherwise create a new mapping.
if ( -not $ssnMap.ContainsKey($matches[1]) ) {
do {
$rnd = Get-Random -Min 100000 -Max 999999
$newSSN = "666-$($rnd -replace '(..)(....)','$1-$2')"
} while ( $ssnMap.ContainsValue($newSSN) ) # loop to avoid collisions
$ssnMap[$matches[1]] = $newSSN
}
# Replace the SSN with the corresponding randomly generated SSN.
$_ -replace $matches[1], $ssnMap[$matches[1]]
} else {
# If no SSN is found, simply print the line.
$_
}
} | Set-Content $filename
如果您已经有一个随机SSN列表,并且还将它们映射到特定的"真实" SSN,您可以将这些映射从CSV(示例列标题:realSSN
,randomSSN
)读取到$ssnMap
哈希表中:
$ssnMap = @{}
Import-Csv 'C:\mappings.csv' | % { $ssnMap[$_.realSSN] = $_.randomSSN }
答案 2 :(得分:0)
如果您已经生成了一个替换的随机SSN列表,并且该文件中的每个SSN只需要用其中一个替换(不一定映射到特定的替换字符串),我认为这将是多少更快:
$inputfile = 'C:\TrainingFile\TrainingFile.txt'
$outputfile = 'C:\TrainingFile\NewTrainingFile.txt'
$replacements = Get-Content 'C:\TrainingFile\SSN_Replacements.txt'
$i=0
Filter Replace-SSN { $_ -replace '\d{3}-\d{2}-\d{4}',$replacements[$i++] }
Get-Content $inputfile |
Replace-SSN |
Set-Content $outputfile
这将遍历您的替换SSN列表,为每个新替换选择列表中的下一个SSN。
编辑:
这是将特定SSN映射到特定替换字符串的解决方案。它假设您有一个原始SSN的CSV文件及其预期的替换字符串,如列'OldSSN'和'NewSSN':
$inputfile = 'C:\TrainingFile\TrainingFile.txt'
$outputfile = 'C:\TrainingFile\NewTrainingFile.txt'
$replacementfile = 'C:\TrainingFile\SSN_Replacements.csv'
$SSNmatch = [regex]'\d{3}-\d{2}-\d{4}'
$replacements = @{}
Import-Csv $replacementfile |
ForEach-Object { $replacements[$_.OldSSN] = $_.NewSSN }
Get-Content $inputfile -ReadCount 1000|
ForEach-Object {
foreach ($Line in $_){
if ( $Line -match $SSNmatch ) #Found SSN in line
{ if ( $replacements.ContainsKey($matches[0]) ) #Found replacement string for this SSN
{ $Line -replace $SSNmatch,$replacements[$matches[0]] } #Replace SSN and ouput line
else {Write-Warning "Warning - no replacement string found for $($matches[0])"
}
}
else { $Line } #No SSN in this line - output line as-is
}
} | Set-Content $outputfile
答案 3 :(得分:-1)
# Fairly fast PowerShell code for masking up to 1000 SSN number per line in a large text file (with unlimited # of lines in the file) where the SSN matches the pattern of " ###-##-#### ", " ##-####### ", or " ######### ".
# This code can handle a 14 MB text file that has SSN numbers in nearly every row within about 4 minutes.
# $inputFilename = 'C:/InputFile.txt'
$inputFileName = "
1
0550 125665 338066
- 02 CR05635 07/06/16
0 SAMPLE CUSTOMER NAME
PO BOX 12345
ROSEVILLE CA 12345-9109
EMPLOYEE DEFERRALS
FREDDIE MAC RO 16 9385456 164-44-9120 XXX
SALLY MAE RO 95 9385356 07-4719130 XXX
FRED FLINTSTONE RO 95 1185456 061741130 XXX
WILMA FLINTSTONE RO 91 9235456 364-74-9130 123456789 123456389 987354321 XXX
PEBBLES RUBBLE RO 10 9235456 06-3749130 064-74-9150 034-74-9130 XXX
BARNEY RUBBLE RO 11 9235456 06-3449130 06-3749140 063-74-9130 XXX
BETTY RUBBLE RO 16 9235456 9-74-9140 123456789 123456789 987654321 XXX
PLEASE ENTER BELOW ANY ADDITIONAL PARTICIPANTS FOR WHOM YOU ARE
REMITTING. FOR GENERAL INFORMATION AND SERVICE CALL
"
$outputFilename = 'D:/OutFile.txt'
#(Get-Content $inputFilename ) | % {
($inputFilename ) | % {
$NewLine=$_
# Write-Host "0 new line value is ($NewLine)."
$ChangeFound='Y'
$WhileCounter=0
While (($ChangeFound -eq 'Y') -and ($WhileCounter -lt 1000))
{
$WhileCounter=$WhileCounter+1
$ChangeFound='N'
$matches = $NewLine | Select-String -pattern "[ ][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9][ |\t|\r|\n]" -AllMatches
If ($matches.length -gt 0)
{
$ChangeFound='Y'
$NewLine=''
for($i = 0; $i -lt 1; $i++){
for($k = 0; $k -lt 1; $k++){
# Write-Host "AmHere 1a `$i ($i), `$k ($k), `$NewLine ($NewLine)."
$t = $matches[$i] -replace $matches[$i].matches[$k].value, (" ###-##-" + $matches[$i].matches[$k].value.substring(8) )
$NewLine=$NewLine + $t
# Write-Host "AmHere 1b `$i ($i), `$k ($k), `$NewLine ($NewLine)."
}
}
# Write-Host "1 new line value is ($NewLine)."
}
$matches = $NewLine | Select-String -pattern "[ ][0-9][0-9]-[0-9][0-9][0-9][0-9][0-9][0-9][0-9][ |\t|\r|\n]" -AllMatches
If ($matches.length -gt 0)
{
$ChangeFound='Y'
$NewLine=''
for($i = 0; $i -lt 1; $i++){
for($k = 0; $k -lt 1; $k++){
# Write-Host "AmHere 2a `$i ($i), `$k ($k), `$NewLine ($NewLine)."
$t = $matches[$i] -replace $matches[$i].matches[$k].value, (" ##-###" + $matches[$i].matches[$k].value.substring(7) )
$NewLine=$NewLine + $t
# Write-Host "AmHere 2b `$i ($i), `$k ($k), `$NewLine ($NewLine)."
}
}
# Write-Host "2 new line value is ($NewLine)."
}
$matches = $NewLine | Select-String -pattern "[ ][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][ |\t|\r|\n]" -AllMatches
If ($matches.length -gt 0)
{
$ChangeFound='Y'
$NewLine=''
for($i = 0; $i -lt 1; $i++){
for($k = 0; $k -lt 1; $k++){
# Write-Host "AmHere 3a `$i ($i), `$k ($k), `$NewLine ($NewLine)."
$t = $matches[$i] -replace $matches[$i].matches[$k].value, (" #####" + $matches[$i].matches[$k].value.substring(6) )
$NewLine=$NewLine + $t
# Write-Host "AmHere 3b `$i ($i), `$k ($k), `$NewLine ($NewLine)."
}
}
#print the line
# Write-Host "3 new line value is ($NewLine)."
}
# Write-Host "4 new line value is ($NewLine)."
} # end of DoWhile
Write-Host "5 new line value is ($NewLine)."
$NewLine
# Replace the SSN with the corresponding randomly generated SSN.
# $_ -replace $matches[1], $ssnMap[$matches[1]]
} | Set-Content $outputFilename