批处理/ Powershell中有一个快速脚本或命令,可以仅分析txt文件各行中的重复段和变量段并将其删除?示例:
输入file1.txt:
abcde11234452232131
abcde6176413190830
abcde6278647822786
abcde676122249819113
输出file1.txt:
11234452232131
6176413190830
6278647822786
676122249819113
输入file2.txt:
11234452232131xyz
6176413190830xyz
6278647822786xyz
676122249819113xyz
输出file2.txt:
11234452232131
6176413190830
6278647822786
676122249819113
我的脚本:
@echo off & setlocal enabledelayedexpansion
:startline
set /p first=<#SHA1.txt
set status=notequal
for /f "delims=" %%a in (#SHA1.txt) do (
set second=%%a
if "!first:~0,1!"=="!second:~0,1!" (set status=equal) else (set status=notequal & goto break)
)
if "!status!"=="equal" (
for /f "delims=" %%a in (#SHA1.txt) do (
set second=%%a
echo !second:~1!>>#SHA1.tmp
)
if exist #SHA1.tmp (del #SHA1.txt & ren #SHA1.tmp #SHA1.txt)
goto startline
)
:break
:endline
set /p first=<#SHA1.txt
set status=notequal
for /f "delims=" %%a in (#SHA1.txt) do (
set second=%%a
if "!first:~-1!"=="!second:~-1!" (set status=equal) else (set status=notequal & goto break)
)
if "!status!"=="equal" (
for /f "delims=" %%a in (#SHA1.txt) do (
set second=%%a
echo !second:~0,-1!>>#SHA1.tmp
)
if exist #SHA1.tmp (del #SHA1.txt & ren #SHA1.tmp #SHA1.txt)
goto endline
)
:break
exit
我认为该脚本在多个文件中运行缓慢。
答案 0 :(得分:2)
这是怎么回事(请参见说明性的::
注释)
@echo off
::This script assumes that the lines of the input file (provided as command line argument)
::do not contain any of the characters `^`, `!`, and `"`. The lines may be of different
::lengths, empty lines are ignored though.
::The script processes the input file in two phase:
::1. let us call this the analysis phase, which consists of the following steps:
:: * read the first line of the file, store the string and determine its length;
:: * read the second line, walk through all characters beginning from the left and from
:: the right side within the same loop, find the character indexes that point to the
:: first left-most and the last right-most character that do not equal the respective
:: ones in the string from the first line, and store the retreived indexes;
:: * read the remaining lines, and for each one, extract the prefix and the suffix that
:: is indicated by the respective stored indexes and compare them with the respective
:: prefix and suffix from the first line; if both are equal, exit with the loop here
:: and continue with the next line; otherwise, walk through all characters beginning
:: before the previous left-most and after the previous right-most character indexes
:: towards the respective ends of the string, find the character indexes that again
:: point to the first left-most and the last right-most character that do not equal
:: the respective ones in the string from the first line, and update the previously
:: stored indexes accordingly;
::2. let us call this the execution phase, which reads the input file again, extracts the
:: portion of each line that is indicated by the two computed indexes and returns it;
::The output is displayed in the console; to write it to a file, use redirection (`>`).
setlocal EnableDelayedExpansion
set "MIN=" & set "MAX=" & set /A "ROW=0"
for /F usebackq^ delims^=^ eol^= %%L in ("%~1") do (
set /A "ROW+=1" & set "STR=%%L"
if !ROW! equ 1 (
call :LENGTH LEN "%%L"
set "SAV=%%L"
) else if !ROW! equ 2 (
set /A "IDX=LEN-1"
for /L %%I in (0,1,!IDX!) do (
if not defined MIN (
if not "!STR:~%%I,1!"=="!SAV:~%%I,1!" set /A "MIN=%%I"
)
if not defined MAX (
set /A "IDX=%%I+1"
for %%J in (!IDX!) do (
if not "!STR:~-%%J,1!"=="!SAV:~-%%J,1!" set /A "MAX=1-%%J"
)
)
)
if not defined MIN set /A "MIN=LEN, MAX=-LEN"
) else (
set "NXT=#"
if !MIN! gtr 0 for %%I in (!MIN!) do if not "!STR:~,%%I!"=="!SAV:~,%%I!" set "NXT="
if !MAX! lss 0 for %%J in (!MAX!) do if not "!STR:~%%J!"=="!SAV:~%%J!" set "NXT="
if not defined NXT (
if !MAX! lss -!MIN! (set /A "IDX=1-MAX") else (set /A "IDX=MIN-1")
for /L %%I in (!IDX!,-1,0) do (
if %%I lss !MIN! (
if not "!STR:~%%I,1!"=="!SAV:~%%I,1!" set /A "MIN=%%I"
)
if -%%I geq !MAX! (
set /A "IDX=%%I+1"
for %%J in (!IDX!) do (
if not "!STR:~-%%J,1!"=="!SAV:~-%%J,1!" set /A "MAX=1-%%J"
)
)
)
)
)
)
if defined MAX if !MAX! equ 0 set "MAX=8192"
for /F "tokens=1,2" %%I in ("%MIN% %MAX%") do (
for /F usebackq^ delims^=^ eol^= %%L in ("%~1") do (
set "STR=%%L"
echo(!STR:~%%I,%%J!
)
)
endlocal
exit /B
:LENGTH <rtn_length> <val_string>
::Function to determine the length of a string.
::PARAMETERS:
:: <rtn_length> variable to receive the resulting string length;
:: <val_string> string value to determine the length of;
set "STR=%~2"
setlocal EnableDelayedExpansion
set /A "LEN=1"
if defined STR (
for %%C in (4096 2048 1024 512 256 128 64 32 16 8 4 2 1) do (
if not "!STR:~%%C!"=="" set /A "LEN+=%%C" & set "STR=!STR:~%%C!"
)
) else set /A "LEN=0"
endlocal & set "%~1=%LEN%"
exit /B
根据数据还可以进一步改善:
:LENGTH
子例程调用,而应使用常量值;如果存在已知的公共前缀/后缀的最大长度,则甚至根本不需要线长; for /L
循环来浏览certan字符范围,由于缺少if
循环或类似while
之类的内容,主体的身体被某些exit for
条件跳过了;我可以使用goto
离开它们,但是随后我需要将这些循环放在单独的子例程中,以免破坏外部循环;无论如何,for [/L]
循环即使在被goto
破坏的情况下仍在后台完成迭代,尽管比执行主体要快。因此,与缓慢的call
和goto
一起,我无疑会提高自己的速度;根据数据的不同,纯goto
循环可能会更有效,因为可以保留它们而无需任何剩余的后台处理,但是当然,还需要将其放置在自己的子例程中; 答案 1 :(得分:2)
从字符串列表中删除未知长度的公共前缀和/或后缀
这批方法采用了一种非常简单(且可能效率低下)的方法
将文件名作为参数传递,否则为file1.txt
。
:: Q:\Test\2018\06\29\SO_51093137.cmd
@echo off & setlocal enabledelayedexpansion
Set "File=%~1"
If not defined File Set "File=file1.txt"
Echo Processing %File%
:: get number of lines
for /f %%i in ('Find /V /C "" ^<"%File%"') Do Set Lines=%%i
Echo #Lines is %Lines%
:: get 1st line
Set /P "Line1=" < "%File%"
Echo Line1 is %Line1%
:: Iterate Prefixlength until Prefix doesn't match all lines
For /L %%i in (1,1,30) Do (
For /F %%A in ('
Findstr /B /L "!Line1:~0,%%i!" "%File%" ^|Find /C /V "" '
) Do Set "EQ=%%A"
If %Lines% neq !EQ! (Set /A "PrefixLength=%%i -1" & Goto :Break1)
)
:Break1
Echo PrefixLength is %PrefixLength%
:: Iterate Suffixlength until Suffix doesn't match all lines
For /L %%i in (-1,-1,-30) Do (
For /F %%A in ('
Findstr /E /L "!Line1:~%%i!" "%File%" ^|Find /C /V "" '
) Do Set "EQ=%%A"
If %Lines% neq !EQ! (Set /A "SuffixLength=%%i +1" & Goto :Break2)
)
:Break2
Echo SuffixLength is %SuffixLength%
Echo ============
For /f "usebackqDelims=" %%A in ("%File%") Do (
Set "Line=%%A"
If %SuffixLength%==0 (
Echo=!Line:~%PrefixLength%!
) Else (
Echo=!Line:~%PrefixLength%,%SuffixLength%!
)
)
示例输出:
> SO_51093137.cmd file2.txt
Processing file2.txt
#Lines is 4
Line1 is 11234452232131xyz
PrefixLength is 0
SuffixLength is -3
============
11234452232131
6176413190830
6278647822786
676122249819113
答案 2 :(得分:1)
关注可能会使事情复杂化,但它使我受到了限制,因此对我来说是一次很棒的学习经历。
$file1 = @(
,'abcde11234452232131'
,'abcde6176413190830'
,'abcde6278647822786'
,'abcde676122249819113'
)
function Test-EqualChar
{
param (
[Scriptblock] $Expression,
[Object[]] $Sequence,
[int] $i
)
!(($Sequence[1..($Sequence.Length -1)] | % {(&$Expression $_ $i) -eq ($Sequence[0][$i])}) -contains $False)
}
$OneChar = {param($x, $i) $x[$i]}
$start = for($i=0;$i -lt ($file1 | % {$_.Length} | Measure -Minimum | Select -ExpandProperty Minimum);$i++) {
if (!(Test-EqualChar $OneChar $file1 $i)) {$i; break}
}
$file1 | % {$_.Substring($start, $_.Length-$start)}
我将把它作为练习来解决反转(或填充)字符串以从字符串末尾删除相等的字符
答案 3 :(得分:1)
此解决方案使用其他方法。恕我直言,这是处理文件的最快方法。
@echo off
setlocal EnableDelayedExpansion
if "%~1" equ "" echo Usage: %0 filename & goto :EOF
if not exist "%~1" echo File not found: "%~1" & goto :EOF
rem Read first two lines and get their base 0 lengths
( set /P "line1=" & set /P "line2=" ) < %1
call :StrLen0Var len1=line1
call :StrLen0Var len2=line2
rem Extract the largest *duplicate segment* from first two lines
set "maxDupSegLen=0"
for /L %%i in (0,1,%len1%) do (
for /L %%j in (0,1,%len2%) do (
if "!line1:~%%i,1!" equ "!line2:~%%j,1!" (
rem New duplicate segment, get its length and keep the largest one
set /A "maxLen=len1-%%i+1, maxLen2=len2-%%j+1"
if !maxLen2! gtr !maxLen! set "maxLen=!maxLen2!"
for /L %%l in (1,1,!maxLen!) do (
if "!line1:~%%i,%%l!" equ "!line2:~%%j,%%l!" set "dupSegLen=%%l"
)
if !dupSegLen! geq !maxDupSegLen! (
set /A "maxDupSegLen=dupSegLen, maxDupSegPos=%%i"
)
)
)
)
set "dupSeg=!line1:~%maxDupSegPos%,%maxDupSegLen%!"
rem Process the file removing duplicate segments
for /F "delims=" %%a in (%1) do (
set "line=%%a"
echo !line:%dupSeg%=!
)
goto :EOF
Get the length base 0 of a variable
:StrLen0Var len= var
setlocal EnableDelayedExpansion
set "str=!%2!"
set "len=0"
for /L %%a in (12,-1,0) do (
set /A "newLen=len+(1<<%%a)"
for %%b in (!newLen!) do if "!str:~%%b,1!" neq "" set "len=%%b"
)
endlocal & set "%1=%len%"
input1.txt:
abcde11234452232131
abcde6176413190830
abcde6278647822786
abcde676122249819113
输出:
11234452232131
6176413190830
6278647822786
676122249819113
input2.txt:
11234452232131xyz
6176413190830xyz
6278647822786xyz
676122249819113xyz
输出:
11234452232131
6176413190830
6278647822786
676122249819113
“行的长度可变,并且可能出现多个重复的部分。”
input3.txt:
abcde11234452232131
6176abcde4131908abcde30
6278647abcde822786
676122249819113abcde
输出:
11234452232131
6176413190830
6278647822786
676122249819113