Question

批处理/ Powershell中有一个快速脚本或命令，可以仅分析txt文件各行中的重复段和变量段并将其删除？示例：

输入file1.txt：

abcde11234452232131
abcde6176413190830
abcde6278647822786
abcde676122249819113

输出file1.txt：

11234452232131
6176413190830
6278647822786
676122249819113

输入file2.txt：

11234452232131xyz
6176413190830xyz
6278647822786xyz
676122249819113xyz

输出file2.txt：

11234452232131
6176413190830
6278647822786
676122249819113

我的脚本：

@echo off & setlocal enabledelayedexpansion

:startline

set /p first=<#SHA1.txt

set status=notequal

for /f "delims=" %%a in (#SHA1.txt) do (
set second=%%a
if "!first:~0,1!"=="!second:~0,1!" (set status=equal) else (set status=notequal & goto break)
)

if "!status!"=="equal" (
for /f "delims=" %%a in (#SHA1.txt) do (
set second=%%a
echo !second:~1!>>#SHA1.tmp
)
if exist #SHA1.tmp (del #SHA1.txt & ren #SHA1.tmp #SHA1.txt)
goto startline
)

:break

:endline

set /p first=<#SHA1.txt

set status=notequal

for /f "delims=" %%a in (#SHA1.txt) do (
set second=%%a
if "!first:~-1!"=="!second:~-1!" (set status=equal) else (set status=notequal & goto break)
)

if "!status!"=="equal" (
for /f "delims=" %%a in (#SHA1.txt) do (
set second=%%a
echo !second:~0,-1!>>#SHA1.tmp
)
if exist #SHA1.tmp (del #SHA1.txt & ren #SHA1.tmp #SHA1.txt)
goto endline
)

:break

exit

我认为该脚本在多个文件中运行缓慢。

Answer 1

这是怎么回事（请参见说明性的::注释）

@echo off
::This script assumes that the lines of the input file (provided as command line argument)
::do not contain any of the characters `^`, `!`, and `"`. The lines may be of different
::lengths, empty lines are ignored though.
::The script processes the input file in two phase:
::1. let us call this the analysis phase, which consists of the following steps:
::    * read the first line of the file, store the string and determine its length;
::    * read the second line, walk through all characters beginning from the left and from
::      the right side within the same loop, find the character indexes that point to the
::      first left-most and the last right-most character that do not equal the respective
::      ones in the string from the first line, and store the retreived indexes;
::    * read the remaining lines, and for each one, extract the prefix and the suffix that
::      is indicated by the respective stored indexes and compare them with the respective
::      prefix and suffix from the first line; if both are equal, exit with the loop here
::      and continue with the next line; otherwise, walk through all characters beginning
::      before the previous left-most and after the previous right-most character indexes
::      towards the respective ends of the string, find the character indexes that again
::      point to the first left-most and the last right-most character that do not equal
::      the respective ones in the string from the first line, and update the previously
::      stored indexes accordingly;
::2. let us call this the execution phase, which reads the input file again, extracts the
::   portion of each line that is indicated by the two computed indexes and returns it;
::The output is displayed in the console; to write it to a file, use redirection (`>`).
setlocal EnableDelayedExpansion

set "MIN=" & set "MAX=" & set /A "ROW=0"
for /F usebackq^ delims^=^ eol^= %%L in ("%~1") do (
    set /A "ROW+=1" & set "STR=%%L"
    if !ROW! equ 1 (
        call :LENGTH LEN "%%L"
        set "SAV=%%L"
    ) else if !ROW! equ 2 (
        set /A "IDX=LEN-1"
        for /L %%I in (0,1,!IDX!) do (
            if not defined MIN (
                if not "!STR:~%%I,1!"=="!SAV:~%%I,1!" set /A "MIN=%%I"
            )
            if not defined MAX (
                set /A "IDX=%%I+1"
                for %%J in (!IDX!) do (
                    if not "!STR:~-%%J,1!"=="!SAV:~-%%J,1!" set /A "MAX=1-%%J"
                )
            )
        )
        if not defined MIN set /A "MIN=LEN, MAX=-LEN"
    ) else (
        set "NXT=#"
        if !MIN! gtr 0 for %%I in (!MIN!) do if not "!STR:~,%%I!"=="!SAV:~,%%I!" set "NXT="
        if !MAX! lss 0 for %%J in (!MAX!) do if not "!STR:~%%J!"=="!SAV:~%%J!" set "NXT="
        if not defined NXT (
            if !MAX! lss -!MIN! (set /A "IDX=1-MAX") else (set /A "IDX=MIN-1")
            for /L %%I in (!IDX!,-1,0) do (
                if %%I lss !MIN! (
                    if not "!STR:~%%I,1!"=="!SAV:~%%I,1!" set /A "MIN=%%I"
                )
                if -%%I geq !MAX! (
                    set /A "IDX=%%I+1"
                    for %%J in (!IDX!) do (
                        if not "!STR:~-%%J,1!"=="!SAV:~-%%J,1!" set /A "MAX=1-%%J"
                    )
                )
            )
        )
    )
)
if defined MAX if !MAX! equ 0 set "MAX=8192"
for /F "tokens=1,2" %%I in ("%MIN% %MAX%") do (
    for /F usebackq^ delims^=^ eol^= %%L in ("%~1") do (
        set "STR=%%L"
        echo(!STR:~%%I,%%J!
    )
)

endlocal
exit /B


:LENGTH  <rtn_length>  <val_string>
    ::Function to determine the length of a string.
    ::PARAMETERS:
    ::  <rtn_length>  variable to receive the resulting string length;
    ::  <val_string>  string value to determine the length of;
    set "STR=%~2"
    setlocal EnableDelayedExpansion
    set /A "LEN=1"
    if defined STR (
        for %%C in (4096 2048 1024 512 256 128 64 32 16 8 4 2 1) do (
            if not "!STR:~%%C!"=="" set /A "LEN+=%%C" & set "STR=!STR:~%%C!"
        )
    ) else set /A "LEN=0"
    endlocal & set "%~1=%LEN%"
    exit /B

根据数据还可以进一步改善：

如果第一行的长度是固定的，或者行的长度在很小的范围内变化，则可以避免使用:LENGTH子例程调用，而应使用常量值；如果存在已知的公共前缀/后缀的最大长度，则甚至根本不需要线长；
代替两次读取文件（由于两次通过算法），您可以在开始时将其读取到内存中，然后再使用这些数据。对于大文件，这可能不是一个好主意；
我使用了多个for /L循环来浏览certan字符范围，由于缺少if循环或类似while之类的内容，主体的身体被某些exit for条件跳过了；我可以使用goto离开它们，但是随后我需要将这些循环放在单独的子例程中，以免破坏外部循环；无论如何，for [/L]循环即使在被goto破坏的情况下仍在后台完成迭代，尽管比执行主体要快。因此，与缓慢的call和goto一起，我无疑会提高自己的速度；根据数据的不同，纯goto循环可能会更有效，因为可以保留它们而无需任何剩余的后台处理，但是当然，还需要将其放置在自己的子例程中；

Answer 2

从字符串列表中删除未知长度的公共前缀和/或后缀

这批方法采用了一种非常简单（且可能效率低下）的方法

它会读取第一行，并使用前30个字符的前缀递增的方式进行迭代
使用findstr匹配行|用管道输送结果以找到行数
如果行数与文件总行数不匹配，则前缀变长并且
该批次将继续进行下一步。
然后将相同的过程用作后缀
最后，这些行被截断（即使前缀和后缀也同时出现）

将文件名作为参数传递，否则为file1.txt。

:: Q:\Test\2018\06\29\SO_51093137.cmd
@echo off & setlocal enabledelayedexpansion
Set "File=%~1"
If not defined File Set "File=file1.txt"
Echo Processing %File%

:: get number of lines
for /f %%i in ('Find /V /C "" ^<"%File%"') Do Set Lines=%%i
Echo #Lines is %Lines%

:: get 1st line
Set /P "Line1=" < "%File%"
Echo Line1 is %Line1%

:: Iterate Prefixlength until Prefix doesn't match all lines
For /L %%i in (1,1,30) Do (
    For /F %%A in ('
        Findstr /B /L "!Line1:~0,%%i!" "%File%" ^|Find  /C /V "" '
    ) Do Set "EQ=%%A"
    If %Lines% neq !EQ! (Set /A "PrefixLength=%%i -1" & Goto :Break1)
)
:Break1
Echo PrefixLength is %PrefixLength%

:: Iterate Suffixlength until Suffix doesn't match all lines
For /L %%i in (-1,-1,-30) Do (
    For /F %%A in ('
        Findstr /E /L "!Line1:~%%i!" "%File%" ^|Find  /C /V "" '
    ) Do Set "EQ=%%A"
    If %Lines% neq !EQ! (Set /A "SuffixLength=%%i +1" & Goto :Break2)
)
:Break2

Echo SuffixLength is %SuffixLength%
Echo ============
For /f "usebackqDelims=" %%A in ("%File%") Do (
    Set "Line=%%A"
    If %SuffixLength%==0 (
        Echo=!Line:~%PrefixLength%!
    ) Else (
        Echo=!Line:~%PrefixLength%,%SuffixLength%!
    )
)

示例输出：

> SO_51093137.cmd file2.txt
Processing file2.txt
#Lines is 4
Line1 is 11234452232131xyz
PrefixLength is 0
SuffixLength is -3
============
11234452232131
6176413190830
6278647822786
676122249819113

Answer 3

关注可能会使事情复杂化，但它使我受到了限制，因此对我来说是一次很棒的学习经历。

$file1 = @(
    ,'abcde11234452232131' 
    ,'abcde6176413190830'
    ,'abcde6278647822786'
    ,'abcde676122249819113'
)

function Test-EqualChar
{
    param (
        [Scriptblock] $Expression,
        [Object[]] $Sequence,
        [int] $i
    )
    !(($Sequence[1..($Sequence.Length -1)] | % {(&$Expression $_ $i) -eq ($Sequence[0][$i])}) -contains $False)
}

$OneChar = {param($x, $i) $x[$i]}
$start = for($i=0;$i -lt ($file1 | % {$_.Length} | Measure -Minimum | Select -ExpandProperty Minimum);$i++) {
    if (!(Test-EqualChar $OneChar $file1 $i)) {$i; break}
}
$file1 | % {$_.Substring($start, $_.Length-$start)}

我将把它作为练习来解决反转（或填充）字符串以从字符串末尾删除相等的字符

Answer 4

此解决方案使用其他方法。恕我直言，这是处理文件的最快方法。

@echo off
setlocal EnableDelayedExpansion

if "%~1" equ "" echo Usage: %0 filename & goto :EOF
if not exist "%~1" echo File not found: "%~1" & goto :EOF

rem Read first two lines and get their base 0 lengths
( set /P "line1=" & set /P "line2=" ) < %1
call :StrLen0Var len1=line1
call :StrLen0Var len2=line2

rem Extract the largest *duplicate segment* from first two lines
set "maxDupSegLen=0"
for /L %%i in (0,1,%len1%) do (
   for /L %%j in (0,1,%len2%) do (
      if "!line1:~%%i,1!" equ "!line2:~%%j,1!" (
         rem New duplicate segment, get its length and keep the largest one
         set /A "maxLen=len1-%%i+1, maxLen2=len2-%%j+1"
         if !maxLen2! gtr !maxLen! set "maxLen=!maxLen2!"
         for /L %%l in (1,1,!maxLen!) do (
            if "!line1:~%%i,%%l!" equ "!line2:~%%j,%%l!" set "dupSegLen=%%l"
         )
         if !dupSegLen! geq !maxDupSegLen! (
            set /A "maxDupSegLen=dupSegLen, maxDupSegPos=%%i"
         )
      )
   )
)
set "dupSeg=!line1:~%maxDupSegPos%,%maxDupSegLen%!"

rem Process the file removing duplicate segments
for /F "delims=" %%a in (%1) do (
   set "line=%%a"
   echo !line:%dupSeg%=!
)

goto :EOF


Get the length base 0 of a variable

:StrLen0Var len= var
setlocal EnableDelayedExpansion
set "str=!%2!"
set "len=0"
for /L %%a in (12,-1,0) do (
   set /A "newLen=len+(1<<%%a)"
   for %%b in (!newLen!) do if "!str:~%%b,1!" neq "" set "len=%%b"
)
endlocal & set "%1=%len%"

input1.txt：

abcde11234452232131
abcde6176413190830
abcde6278647822786
abcde676122249819113

输出：

11234452232131
6176413190830
6278647822786
676122249819113

input2.txt：

11234452232131xyz
6176413190830xyz
6278647822786xyz
676122249819113xyz

输出：

11234452232131
6176413190830
6278647822786
676122249819113

“行的长度可变，并且可能出现多个重复的部分。”

input3.txt：

abcde11234452232131
6176abcde4131908abcde30
6278647abcde822786
676122249819113abcde

输出：

11234452232131
6176413190830
6278647822786
676122249819113

批量删除字符串中仅重复的段

4 个答案: