Question

我有一个包含多个功能块的文本文件，其中一些是重复的。我想创建一个只包含唯一功能块的新文件。例如 input.txt（我已经更新了示例）：

Func (a1,b1) abc1
{
xyz1;
    {
        xy1;
    }

xy1;
}

Func (a2,b2) abc2
{
xyz2;
    {
        xy2;
        rst2;
    }

xy2;
}

Func (a1,b1) abc1
{
xyz1;
    {
        xy1;
    }

xy1;
}

Func (a3,b3) abc3
{
xyz3;
    {
        xy3;
        rst3;
        def3;
    }

xy3;
}
    Func (a1,b1) abc1
{
xyz1;
    {
        xy1;
    }

xy1;
}

并希望将output.txt设为：

Func (a1,b1) abc1
{
xyz1;
    {
        xy1;
    }

xy1;
}

Func (a2,b2) abc2
{
xyz2;
    {
        xy2;
        rst2;
    }

xy2;
}

Func (a3,b3) abc3
{
xyz3;
    {
        xy3;
        rst3;
        def3;
    }

xy3;
}

我找到了一个使用awk删除重复行的解决方案，例如：

$ awk '!a[$0]++' input.txt > output.txt

但问题是上述解决方案只匹配单行而不匹配文本块。我想将此awk解决方案与正则表达式相结合以匹配单个功能块：'/^FUNC(.|\n)*?\n}/'

但我无法做到这一点。任何建议/解决方案都会非常有用。

Answer 1

$ awk '$1=="Func"{ f=!seen[$NF]++ } f' file
Func (a1,b1) abc1
{
xyz1;
    {
        xy1;
    }

xy1;
}

Func (a2,b2) abc2
{
xyz2;
    {
        xy2;
        rst2;
    }

xy2;
}

Func (a3,b3) abc3
{
xyz3;
    {
        xy3;
        rst3;
        def3;
    }

xy3;
}

以上假设每个Func定义都在它自己的行上，并且该行以函数名结尾。

所有这一切都是查找“Func”行，然后将标志f设置为true，如果这是我们第一次看到行末的函数名称，否则为false（使用您在问题中已经使用的常见awk惯用语!seen[$NF]++，但命名为数组a[]）。然后，如果f为真，则打印当前行（即，您遵循先前未见过的函数名称的Func定义），否则跳过它（即，您正在遵循函数名称的Func定义）之前见过。）

Answer 2

如果您的代码块用空行分隔，您可以定义记录分隔符（和输出记录分隔符）...

$ awk -v RS= -v ORS='\n\n' '!a[$0]++' input.txt > output.txt

NB。适用于玩具示例，但这很脆弱，因为代码块中的任何空行都会破坏逻辑。类似地，您不能依赖大括号，因为它也可能出现在代码块中。

<强>更新

对于更新后的输入，这可能效果更好

$ awk -v ORS='\n\n' '{record=($1~/^Func/)?$0:record RS $0} 
    /^}/ && !a[record]++{print record} '

这里我们定义以“Func”关键字开头的记录，并在第一个位置以大括号结束。累计记录行并完成打印。将ORS设置为在记录之间有空行。

Answer 3

由于OP改变了要求和示例，所以我重新编写了代码，如果这对你有所帮助（请在这里阅读Input_file 2次），请试着让我知道。

awk 'FNR==NR && /Func/ && !a[$0]++{gsub(/^ +/,"");!b[$0]++;next} FNR!=NR && /Func/{flag=($0 in b)?1:"";delete b[$0]} flag'  Input_file  Input_file

现在也为解决方案添加非单一衬里解决方案。

awk '
FNR==NR && /Func/ && !a[$0]++{
  gsub(/^ +/,"");
  !b[$0]++;
  next}
FNR!=NR && /Func/{
  flag=($0 in b)?1:"";
  delete b[$0]}
flag
'   Input_file  Input_file

Answer 4

根据您的实际目的调整此代码（不知道样本中语言的确切协议和格式）。代码是自我评论的

Uncaught TypeError: Cannot read property 'value' of undefined

Answer 5

感谢所有人的解决方案。根据我发布的示例，它们是正确的，但我的实际任务更通用。我在Python中找到了一个通用的解决方案，因为上面提到的响应并不完美（可能是因为我对bash的了解有限）。我使用Pythons的通用解决方案如下：

import re
import os

testFolder = "./Path"

#Usage: Remove duplicate function block from one or more .txt files available in testFolder

#Iterating through the list of all the files available
for testFilePath in os.listdir(testFolder):
    if testFilePath.endswith(".txt"):
        #Extracting path for each text file found
        inputFile = open (testFolder + "/" + testFilePath, "r")

        #Creating a reduced folder in the output path
        outputPath = testFolder + "/Reduced"
        if not os.path.exists(outputPath):
            os.makedirs(outputPath)
        outputFile = open (outputPath + "/" + testFilePath, "w")

        #Reading all the content into a single string
        fileContent = inputFile.read()

        #Pattern for matching a Function block. Pattern matches multiple lines
        pattern = re.compile('(^FUNC(.|\n)*?\n})*',re.M)

        # Creating a list of function blocks
        funcList = pattern.findall(fileContent)
        #Creating a list of unique function block, thus removing duplicate data
        uniqueFuncList = set(funcList)

        #Writing each Function block to the output file separeted by a new line
        for element in uniqueFuncList:
            outputFile.write(element[0] + "\n\n") 
        inputFile.close()
        outputFile.close()

使用＆＃; awk＆＃39; / Python（通用解决方案）删除重复的功能块

5 个答案: