批量删除字符串中仅重复的段

时间:2018-06-29 01:31:22

标签: powershell batch-file cmd

批处理/ Powershell中有一个快速脚本或命令,可以仅分析txt文件各行中的重复段和变量段并将其删除?示例:

输入file1.txt:

abcde11234452232131
abcde6176413190830
abcde6278647822786
abcde676122249819113

输出file1.txt:

11234452232131
6176413190830
6278647822786
676122249819113

输入file2.txt:

11234452232131xyz
6176413190830xyz
6278647822786xyz
676122249819113xyz

输出file2.txt:

11234452232131
6176413190830
6278647822786
676122249819113

我的脚本:

@echo off & setlocal enabledelayedexpansion

:startline

set /p first=<#SHA1.txt

set status=notequal

for /f "delims=" %%a in (#SHA1.txt) do (
set second=%%a
if "!first:~0,1!"=="!second:~0,1!" (set status=equal) else (set status=notequal & goto break)
)

if "!status!"=="equal" (
for /f "delims=" %%a in (#SHA1.txt) do (
set second=%%a
echo !second:~1!>>#SHA1.tmp
)
if exist #SHA1.tmp (del #SHA1.txt & ren #SHA1.tmp #SHA1.txt)
goto startline
)

:break

:endline

set /p first=<#SHA1.txt

set status=notequal

for /f "delims=" %%a in (#SHA1.txt) do (
set second=%%a
if "!first:~-1!"=="!second:~-1!" (set status=equal) else (set status=notequal & goto break)
)

if "!status!"=="equal" (
for /f "delims=" %%a in (#SHA1.txt) do (
set second=%%a
echo !second:~0,-1!>>#SHA1.tmp
)
if exist #SHA1.tmp (del #SHA1.txt & ren #SHA1.tmp #SHA1.txt)
goto endline
)

:break

exit

我认为该脚本在多个文件中运行缓慢。

4 个答案:

答案 0 :(得分:2)

这是怎么回事(请参见说明性的::注释)

@echo off
::This script assumes that the lines of the input file (provided as command line argument)
::do not contain any of the characters `^`, `!`, and `"`. The lines may be of different
::lengths, empty lines are ignored though.
::The script processes the input file in two phase:
::1. let us call this the analysis phase, which consists of the following steps:
::    * read the first line of the file, store the string and determine its length;
::    * read the second line, walk through all characters beginning from the left and from
::      the right side within the same loop, find the character indexes that point to the
::      first left-most and the last right-most character that do not equal the respective
::      ones in the string from the first line, and store the retreived indexes;
::    * read the remaining lines, and for each one, extract the prefix and the suffix that
::      is indicated by the respective stored indexes and compare them with the respective
::      prefix and suffix from the first line; if both are equal, exit with the loop here
::      and continue with the next line; otherwise, walk through all characters beginning
::      before the previous left-most and after the previous right-most character indexes
::      towards the respective ends of the string, find the character indexes that again
::      point to the first left-most and the last right-most character that do not equal
::      the respective ones in the string from the first line, and update the previously
::      stored indexes accordingly;
::2. let us call this the execution phase, which reads the input file again, extracts the
::   portion of each line that is indicated by the two computed indexes and returns it;
::The output is displayed in the console; to write it to a file, use redirection (`>`).
setlocal EnableDelayedExpansion

set "MIN=" & set "MAX=" & set /A "ROW=0"
for /F usebackq^ delims^=^ eol^= %%L in ("%~1") do (
    set /A "ROW+=1" & set "STR=%%L"
    if !ROW! equ 1 (
        call :LENGTH LEN "%%L"
        set "SAV=%%L"
    ) else if !ROW! equ 2 (
        set /A "IDX=LEN-1"
        for /L %%I in (0,1,!IDX!) do (
            if not defined MIN (
                if not "!STR:~%%I,1!"=="!SAV:~%%I,1!" set /A "MIN=%%I"
            )
            if not defined MAX (
                set /A "IDX=%%I+1"
                for %%J in (!IDX!) do (
                    if not "!STR:~-%%J,1!"=="!SAV:~-%%J,1!" set /A "MAX=1-%%J"
                )
            )
        )
        if not defined MIN set /A "MIN=LEN, MAX=-LEN"
    ) else (
        set "NXT=#"
        if !MIN! gtr 0 for %%I in (!MIN!) do if not "!STR:~,%%I!"=="!SAV:~,%%I!" set "NXT="
        if !MAX! lss 0 for %%J in (!MAX!) do if not "!STR:~%%J!"=="!SAV:~%%J!" set "NXT="
        if not defined NXT (
            if !MAX! lss -!MIN! (set /A "IDX=1-MAX") else (set /A "IDX=MIN-1")
            for /L %%I in (!IDX!,-1,0) do (
                if %%I lss !MIN! (
                    if not "!STR:~%%I,1!"=="!SAV:~%%I,1!" set /A "MIN=%%I"
                )
                if -%%I geq !MAX! (
                    set /A "IDX=%%I+1"
                    for %%J in (!IDX!) do (
                        if not "!STR:~-%%J,1!"=="!SAV:~-%%J,1!" set /A "MAX=1-%%J"
                    )
                )
            )
        )
    )
)
if defined MAX if !MAX! equ 0 set "MAX=8192"
for /F "tokens=1,2" %%I in ("%MIN% %MAX%") do (
    for /F usebackq^ delims^=^ eol^= %%L in ("%~1") do (
        set "STR=%%L"
        echo(!STR:~%%I,%%J!
    )
)

endlocal
exit /B


:LENGTH  <rtn_length>  <val_string>
    ::Function to determine the length of a string.
    ::PARAMETERS:
    ::  <rtn_length>  variable to receive the resulting string length;
    ::  <val_string>  string value to determine the length of;
    set "STR=%~2"
    setlocal EnableDelayedExpansion
    set /A "LEN=1"
    if defined STR (
        for %%C in (4096 2048 1024 512 256 128 64 32 16 8 4 2 1) do (
            if not "!STR:~%%C!"=="" set /A "LEN+=%%C" & set "STR=!STR:~%%C!"
        )
    ) else set /A "LEN=0"
    endlocal & set "%~1=%LEN%"
    exit /B

根据数据还可以进一步改善:

  • 如果第一行的长度是固定的,或者行的长度在很小的范围内变化,则可以避免使用:LENGTH子例程调用,而应使用常量值;如果存在已知的公共前缀/后缀的最大长度,则甚至根本不需要线长;
  • 代替两次读取文件(由于两次通过算法),您可以在开始时将其读取到内存中,然后再使用这些数据。对于大文件,这可能不是一个好主意;
  • 我使用了多个for /L循环来浏览certan字符范围,由于缺少if循环或类似while之类的内容,主体的身体被某些exit for条件跳过了;我可以使用goto离开它们,但是随后我需要将这些循环放在单独的子例程中,以免破坏外部循环;无论如何,for [/L]循环即使在被goto破坏的情况下仍在后台完成迭代,尽管比执行主体要快。因此,与缓慢的callgoto一起,我无疑会提高自己的速度;根据数据的不同,纯goto循环可能会更有效,因为可以保留它们而无需任何剩余的后台处理,但是当然,还需要将其放置在自己的子例程中;

答案 1 :(得分:2)

从字符串列表中删除未知长度的公共前缀和/或后缀

这批方法采用了一种非常简单(且可能效率低下)的方法

  • 它会读取第一行,并使用前30个字符的前缀递增的方式进行迭代
  • 使用findstr匹配行|用管道输送结果以找到行数
  • 如果行数与文件总行数不匹配,则前缀变长并且
    该批次将继续进行下一步。
  • 然后将相同的过程用作后缀
  • 最后,这些行被截断(即使前缀和后缀也同时出现)

将文件名作为参数传递,否则为file1.txt

:: Q:\Test\2018\06\29\SO_51093137.cmd
@echo off & setlocal enabledelayedexpansion
Set "File=%~1"
If not defined File Set "File=file1.txt"
Echo Processing %File%

:: get number of lines
for /f %%i in ('Find /V /C "" ^<"%File%"') Do Set Lines=%%i
Echo #Lines is %Lines%

:: get 1st line
Set /P "Line1=" < "%File%"
Echo Line1 is %Line1%

:: Iterate Prefixlength until Prefix doesn't match all lines
For /L %%i in (1,1,30) Do (
    For /F %%A in ('
        Findstr /B /L "!Line1:~0,%%i!" "%File%" ^|Find  /C /V "" '
    ) Do Set "EQ=%%A"
    If %Lines% neq !EQ! (Set /A "PrefixLength=%%i -1" & Goto :Break1)
)
:Break1
Echo PrefixLength is %PrefixLength%

:: Iterate Suffixlength until Suffix doesn't match all lines
For /L %%i in (-1,-1,-30) Do (
    For /F %%A in ('
        Findstr /E /L "!Line1:~%%i!" "%File%" ^|Find  /C /V "" '
    ) Do Set "EQ=%%A"
    If %Lines% neq !EQ! (Set /A "SuffixLength=%%i +1" & Goto :Break2)
)
:Break2

Echo SuffixLength is %SuffixLength%
Echo ============
For /f "usebackqDelims=" %%A in ("%File%") Do (
    Set "Line=%%A"
    If %SuffixLength%==0 (
        Echo=!Line:~%PrefixLength%!
    ) Else (
        Echo=!Line:~%PrefixLength%,%SuffixLength%!
    )
)

示例输出:

> SO_51093137.cmd file2.txt
Processing file2.txt
#Lines is 4
Line1 is 11234452232131xyz
PrefixLength is 0
SuffixLength is -3
============
11234452232131
6176413190830
6278647822786
676122249819113

答案 2 :(得分:1)

关注可能会使事情复杂化,但它使我受到了限制,因此对我来说是一次很棒的学习经历。

$file1 = @(
    ,'abcde11234452232131' 
    ,'abcde6176413190830'
    ,'abcde6278647822786'
    ,'abcde676122249819113'
)

function Test-EqualChar
{
    param (
        [Scriptblock] $Expression,
        [Object[]] $Sequence,
        [int] $i
    )
    !(($Sequence[1..($Sequence.Length -1)] | % {(&$Expression $_ $i) -eq ($Sequence[0][$i])}) -contains $False)
}

$OneChar = {param($x, $i) $x[$i]}
$start = for($i=0;$i -lt ($file1 | % {$_.Length} | Measure -Minimum | Select -ExpandProperty Minimum);$i++) {
    if (!(Test-EqualChar $OneChar $file1 $i)) {$i; break}
}
$file1 | % {$_.Substring($start, $_.Length-$start)}

我将把它作为练习来解决反转(或填充)字符串以从字符串末尾删除相等的字符

答案 3 :(得分:1)

此解决方案使用其他方法。恕我直言,这是处理文件的最快方法。

@echo off
setlocal EnableDelayedExpansion

if "%~1" equ "" echo Usage: %0 filename & goto :EOF
if not exist "%~1" echo File not found: "%~1" & goto :EOF

rem Read first two lines and get their base 0 lengths
( set /P "line1=" & set /P "line2=" ) < %1
call :StrLen0Var len1=line1
call :StrLen0Var len2=line2

rem Extract the largest *duplicate segment* from first two lines
set "maxDupSegLen=0"
for /L %%i in (0,1,%len1%) do (
   for /L %%j in (0,1,%len2%) do (
      if "!line1:~%%i,1!" equ "!line2:~%%j,1!" (
         rem New duplicate segment, get its length and keep the largest one
         set /A "maxLen=len1-%%i+1, maxLen2=len2-%%j+1"
         if !maxLen2! gtr !maxLen! set "maxLen=!maxLen2!"
         for /L %%l in (1,1,!maxLen!) do (
            if "!line1:~%%i,%%l!" equ "!line2:~%%j,%%l!" set "dupSegLen=%%l"
         )
         if !dupSegLen! geq !maxDupSegLen! (
            set /A "maxDupSegLen=dupSegLen, maxDupSegPos=%%i"
         )
      )
   )
)
set "dupSeg=!line1:~%maxDupSegPos%,%maxDupSegLen%!"

rem Process the file removing duplicate segments
for /F "delims=" %%a in (%1) do (
   set "line=%%a"
   echo !line:%dupSeg%=!
)

goto :EOF


Get the length base 0 of a variable

:StrLen0Var len= var
setlocal EnableDelayedExpansion
set "str=!%2!"
set "len=0"
for /L %%a in (12,-1,0) do (
   set /A "newLen=len+(1<<%%a)"
   for %%b in (!newLen!) do if "!str:~%%b,1!" neq "" set "len=%%b"
)
endlocal & set "%1=%len%"

input1.txt:

abcde11234452232131
abcde6176413190830
abcde6278647822786
abcde676122249819113

输出:

11234452232131
6176413190830
6278647822786
676122249819113

input2.txt:

11234452232131xyz
6176413190830xyz
6278647822786xyz
676122249819113xyz

输出:

11234452232131
6176413190830
6278647822786
676122249819113

“行的长度可变,并且可能出现多个重复的部分。”

input3.txt:

abcde11234452232131
6176abcde4131908abcde30
6278647abcde822786
676122249819113abcde

输出:

11234452232131
6176413190830
6278647822786
676122249819113