Question

我正在寻找一个批处理命令文件（或Autohotkey）解决方案，以帮助我合并两个巨大的.txt文件，这些文件在每行的开头共享一个公共单词，附加了该行的第二部分第二个文件到第一个文件的每一行的末尾，只有当两个文件中的第一个单词完全匹配时，才将结果输出到第三个文件。

我环顾了一会儿，发现了类似的例子，但无论是否有匹配的单词，它们似乎都是成对合并的。

所以我在寻找：

File1.txt包含：

安娜= 134 =地址= nonoanneo = othertext

罗伯特= 682sd =活动促销= elseinfo =等

理查德= Carnameother =狗=眼镜

保= Home4562施工=车

杰西卡=考拉= 6788655 = anyothertext = anyothertext

File2.txt包含：

安娜=瑜伽

罗伯特=乒乓

卡洛斯=园艺

理查德=视频游戏

杰西卡=影

ExpectedResult.txt

安娜= 134 =地址= nonoanneo = othertext =瑜伽

罗伯特= 682sd =活动促销= elseinfo =等等=乒乓

理查德= Carnameother =狗=眼镜=视频游戏

杰西卡=考拉= 6788655 = anyothertext = anyothertext =影

请注意，Paula（File1.txt）或Carlos（File2.txt）都不包含在最终结果中，因为在任一文件中都没有匹配的第一个单词。

我在autohotkey中得到了一个有效的解决方案，但它使用了嵌套循环，每个文件中大约需要花费半小时来比较大约30.000行，所以它不是一个选项：/

我知道解决方案对你们中的某些人来说可能是轻而易举的事情，所以提前感谢任何指针。

Answer 1

执行标准文件合并的方法是通过公共密钥对两个文件进行排序，因此任何文件中的遗漏记录都可以通过其在文件中的位置立即识别。但是，在这种情况下，没有任何一个文件被排序，因此有必要在存储器阵列中加载一个完整的文件。

下面的批处理文件是这个问题的解决方案，但批处理文件的速度受定义的内存变量数量的影响，所以我不能保证它会非常快。我只能希望它比AutoIT解决方案更快......

此方法假设在File1.txt中只有一条记录具有相同的密钥。

@echo off
setlocal EnableDelayedExpansion

rem Load the second file in an array
for /F "tokens=1* delims==" %%a in (File2.txt) do set "w[%%a]=%%b"

rem Process the first file, and merge it with the array
(for /F "tokens=1* delims==" %%a in (File1.txt) do (
   if defined w[%%a] (
      echo %%a=%%b=!w[%%a]!
      set "w[%%a]="
   )
)) > Result.txt

编辑：您可以尝试删除set "w[%%a]="行（以及不必要的括号）的第二个测试。这可能会导致执行速度加快，具体取决于数据。

Answer 2

尝试此AutoIT代码：

#include <array.au3>

;The handle of the result file (result.txt)
Local $hResultFile = FileOpen(@ScriptDir&"\result.txt", 1)

;Putting each file into memory
$hfile1 = FileOpen(@ScriptDir&"\file1.txt")
$hfile2 = FileOpen(@ScriptDir&"\file2.txt")
$file1 = FileRead($hFile1)
$file2 = FileRead($hFile2)

;Creating an array with eachline of the files

$aFile1 = StringSplit($File1,@LF)
$aFile2 = StringSplit($File2,@LF)

;looping in the array to test each line

For $i = 1 to UBound($aFile1)-1
;Getting the name to test in File2.txt
   $TestName = stringsplit($aFile1[$i],"=")[1]
;Testing if the name exist in File2 and getting the Index in $aFile2 of the finded string
   $Index =_ArraySearch($aFile2,$TestName,-1,-1,-1,-1,3)
;If a matching name is found then writing the new line in result.txt
   if not @error Then
      $Line=StringReplace($aFile1[$i] & "=" & StringSplit($aFile2[$Index],"=")[2],@CR,"")
      FileWriteLine($hResultFile,$Line&@CRLF)
      EndIf
   Next
FileClose($hResultFile)
FileClose($hFile1)
FileClose($hFile2)

输出文件为result.txt

Answer 3

好的，这是我的解决方案：

#m::    ; the hotkey
TempFile := "C:\copy\temp.txt"
NewFile := "C:\copy\new.txt"
Run, %comspec% /c copy /Y C:\copy\File1.txt+C:\copy\File2.txt  %TempFile% ; merge all files with extension txt
Loop ;waits for file
{
IfExist, %TempFile%
    Break
Sleep 1000
}
FileRead, Contents, %TempFile%
if not ErrorLevel  ; Successfully loaded.
{
    Sort, Contents
}
Contents := RegExReplace(Contents, "(\r\n|^)([^=]+?)(=[^\r\n]+?)\r\n\2(=[^\r\n]+?(\r\n|$|))", "$1_$2$3$4") ; finds Matching names
Contents := RegExReplace(Contents, "(\r\n|^)[^_][^\r\n]+?(\r\n|$)", "$1$2") ;removes not matching rows
Contents := RegExReplace(Contents, "(\r\n|^)_", "$1") ;removes underscore
Contents := RegExReplace(Contents, "\s\s+", "`r`n") ; removes empty rows
Contents := RegExReplace(Contents, "", "") ; after merge we have this symbol so we can remove it
Contents := RegExReplace(Contents, "^\s+", "") ; remove first empty row
FileAppend, %Contents%, %NewFile%
Contents =  ; Free the memory.
FileDelete, %TempFile%
return

这里有一些问题，如：

File1和File2必须为空最后一行
仅合并具有相同名称的第二行，第三行将被忽略（可以通过将第一个RegExReplace放入Loop并检查是否匹配来修复
不确定订单行是否合并（File1行为File2或File2为File1）
如果File1或File2有2个相同的名称，则会合并它们。

似乎工作得很快.....

这是一个有点改变的脚本，它从File1获取行并附加File2行：

#m::    ; the hotkey

NewFile := "C:\copy\new.txt"

FileRead, Contents1, C:\copy\File1.txt
Contents1 := RegExReplace(Contents1, "(\r\n|^)([^=]+?)=", "$1$21=") 

FileRead, Contents2, C:\copy\File2.txt
Contents2 := RegExReplace(Contents2, "(\r\n|^)([^=]+?)=", "$1$22=") 

Contents := Contents1 . "`r`n" . Contents2 . "`r`n"

Contents1 =
Contents2 = 

if not ErrorLevel  ; Successfully loaded.
{
    Sort, Contents
}

Contents := RegExReplace(Contents, "(\r\n|^)([^1=]+?)1(=[^\r\n]+?)\r\n\2(2)", "$1_$2$3") ; finds Matching names
Contents := RegExReplace(Contents, "(\r\n|^)[^_][^\r\n]+?(\r\n|$)", "$1$2") ;removes not matching rows
Contents := RegExReplace(Contents, "(\r\n|^)_", "$1") ;removes underscore
Contents := RegExReplace(Contents, "\s\s+", "`r`n") ; removes empty rows
Contents := RegExReplace(Contents, "^\s+", "") ; remove first empty row
FileAppend, %Contents%, %NewFile%
Contents =  ; Free the memory.

return

Answer 4

问题：如果你的文件是按第一行排序的，你知道吗？在这种情况下，你可以真正加快速度，你可以在两个文件中使用两个指针，比如以mergesort方式。（但你可能知道那个？：））

我不得不承认我不是一个ahk pro，所以我的回答是Python。更糟糕的是，我无法测试你需要的大文件。

但无论如何我认为我试一试，它甚至可以帮助某人，所以这是我的解决方案：

f1 = 'File1.txt'
f2 = 'File2.txt'
f3 = 'File3.txt'
with open(f1, 'r') as file1, open(f2, 'r') as file2, open(f3, 'w') as file3:
    words1 = dict()
    words2 = dict()
    for line in file1:
        k,e = line.split('=',1)
        e = e.rstrip('\n')
        if k in words1:
            words1[k] += '=' + e
        else:
            words1[k] = '=' + e

    for line in file2:
        k,e = line.split('=',1)
        e = e.rstrip('\n')
        if k in words2:
            words2[k] += '=' + e
        else:
            words2[k] = '=' + e

    words = set(words1).intersection(set(words2))

    for w in words:
        file3.write(w + words1[w] + words2[w] + '\n')

也许您需要检查是否＆＃39; \ n＆＃39;确实是你的直线，但那只是一些操作。*东西..

Answer 5

SachaDee，非常感谢您花时间回复。

好消息是，您提供的脚本完美且直接开箱即用。另外，我学习了一点AutoIt。

坏消息是，与我处理的大型文件一起使用速度非常慢。事实上，我尝试了我的旧脚本，其中有一个嵌套循环，我还编写了它的变体，使用SplitString和一个简单的循环而不是两个解析循环。

结果如下：

在30.000个要比较的寄存器上，生成一个3.5Mb的文本文件作为输出：

原始剧本：约18分钟

我的新剧本：大约14分钟

AutoIt脚本：约31分钟

所以，他们都产生相同的输出，但消耗的时间差异是巨大的！ AutoIt脚本启动速度非常快，有时会发出很多字节，比ahk更快......但是它被卡在文件中间，那里有很多重复文本并且速度很慢，而两个脚本都是啊继续以同样的速度。

让我感到困惑的是，我看到有人设法在大约1.5分钟内合并完全相同的文件！不知道他正在使用什么，但看起来像伏都教或某种黑魔法：）

到目前为止我发布了最快的一个，所以人们可能想尝试一下。对于较小的文件，它可能非常快：

#SingleInstance Force
#MaxThreads 255
#MaxMem 4095
#MaxThreadsBuffer On
#ListLines Off
SetBatchLines, -1
SetKeyDelay, -1

FileRead, XMLRead, .\GeneratedFiles\Text1.ini
StringSplit, XMLLines, XMLRead, `n
FileRead, CVRead, .\GeneratedFiles\Text2.ini
Loop,
{
    GameLine := XMLLines%A_Index%
    If Instr(GameLine, "=")
    {
    StringSplit, GameInfo, GameLine, "=",

    Loop, parse, CVRead, `n
    {
        StringSplit, CVReadB, A_LoopField, "=",
        If (GameInfo1 = CVReadB1)
        {
        StringTrimRight, GameLine, GameLine, 1
        MergeResult = %GameLine%=%CVReadB2%

        FileAppend, %MergeResult%, MergeNPCV.ini
        }
    }
    }
}

Answer 6

使用AutoHotkey（AHK_L）的另一种方法，看看它是否适合你

SetBatchLines, -1
FileRead, file1, .\GeneratedFiles\Text1.ini
FileRead, file2, .\GeneratedFiles\Text2.ini
f2 := []
loop, parse, File2, `n, `r
    x:=StrSplit(A_LoopField, "="), f2[x.1] := x.2
loop, parse, File1, `n, `r
    if f2[v:=StrSplit(A_LoopField, "=").1]
        R .= A_LoopField "=" f2[v] "`r`n"
f2 := []
FileAppend, % Trim(R, "`r`n"), MergeNPCV.ini

在两个文件中找到常用单词时，逐行合并/附加txt文件

6 个答案: