Question

如何通过linux工具完全删除重复的行，如grep，sort，sed，uniq？

这个问题真的很难写，因为我看不出任何能给它带来意义的东西。但这个例子显然是直截了当的。如果我有这样的文件：

在解析删除重复行的文件之后，变成这样：

1
3
4

我知道python或其中的一些，这是我写的一个python脚本来执行它。创建一个名为clean_duplicates.py的文件并将其运行为：

import sys

#
# To run it use:
# python clean_duplicates.py < input.txt > clean.txt
#
def main():

    lines = sys.stdin.readlines()

    # print( lines )
    clean_duplicates( lines )

#
# It does only removes adjacent duplicated lines, so your need to sort them
# with sensitive case before run it.
# 
def clean_duplicates( lines ):

    lastLine    = lines[ 0 ]
    nextLine    = None
    currentLine = None
    linesCount  = len( lines )

    # If it is a one lined file, to print it and stop the algorithm
    if linesCount == 1:

        sys.stdout.write( lines[ linesCount - 1 ] )
        sys.exit()

    # To print the first line
    if linesCount > 1 and lines[ 0 ] != lines[ 1 ]:

        sys.stdout.write( lines[ 0 ] )

    # To print the middle lines, range( 0, 2 ) create the list [0, 1]
    for index in range( 1, linesCount - 1 ):

        currentLine = lines[ index ]
        nextLine    = lines[ index + 1 ]

        if currentLine == lastLine:

            continue

        lastLine = lines[ index ]

        if currentLine == nextLine:

            continue

        sys.stdout.write( currentLine )

    # To print the last line
    if linesCount > 2 and lines[ linesCount - 2 ] != lines[ linesCount - 1 ]:

        sys.stdout.write( lines[ linesCount - 1 ] )

if __name__ == "__main__":

    main()

虽然在搜索重复行时删除似乎更容易使用工具如grep，sort，sed，uniq：

Answer 1

您可以将uniq与-u / --unique选项一起使用。根据{{3}}：

-u / --unique

请勿输出输入中重复的行      仅打印INPUT中唯一的行。

例如：

cat /tmp/uniques.txt | uniq -u

或者，如uniq man page中所述，更好的方法是：

uniq -u /tmp/uniques.txt

这两个命令都会返回值：

1
3
4

其中 /tmp/uniques.txt 包含问题中提到的数字，即

注意：uniq要求对文件内容进行排序。如UUOC: Useless use of cat中所述：

默认情况下，uniq会在已排序的文件中打印唯一的行，它会丢弃除了一个相同的连续输入行之外的所有行。这样OUTPUT包含唯一的行。

如果文件未排序，则需要先doc内容然后在已排序的内容上使用uniq：

sort /tmp/uniques.txt | uniq -u

Answer 2

不需要排序，输出顺序与输入顺序相同：

$ awk 'NR==FNR{c[$0]++;next} c[$0]==1' file file
1
3
4

Answer 3

Europe Finland Office Supplies Online H 5/21/2015 193508565 7/3/2015 2339 651.21 524.96 1523180.19 1227881.44 295298.75
Europe Greece Household Online L 9/11/2015 895509612 9/26/2015 49 668.27 502.54 32745.23 24624.46 8120.77
Europe Hungary Beverages Online C 8/21/2012 722931563 8/25/2012 370 47.45 31.79 17556.50 11762.30 5794.20
Europe Hungary Beverages Online C 8/21/2012 722931563 8/25/2012 370 47.45 31.79 17556.50 11762.30 5794.20

如果您有这种类型的行，则可以使用此命令。

[isuru@192 ~]$ sort duplines.txt | sed 's/\ /\-/g' | uniq | sed 's/\-/\ /g'

但是使用特殊字符时请记住。如果您的行中有破折号，请确保使用其他符号。在这里，我在反斜杠和正斜杠之间留了一个空格。

Before applied the code

After applied the code

Answer 4

请使用带有 sort 参数的 -u 命令来列出任何命令输出的唯一值。

    cat file_name |sort -u
1
2
3
4

如何通过linux工具完全删除重复的行，如grep，sort，sed，uniq？

如何通过linux工具完全删除重复的行，如grep，sort，sed，uniq？

4 个答案: