Unix删除基于2列的csv中的重复行

时间:2017-01-10 12:19:33

标签: csv unix duplicates

我有一个包含多列的csv文件。有些人可能在第4列(col4)上有重复。

我需要删除重复发生的整行,并且只保留1行。这一行的决定是从col1获得最高值。

以下是一个例子:

col1,col2,col3,col4 

1,x,a,123

2,y,b,123

3,y,b,123

1,z ,c,999

在第1行和第2行以及第3行中找到重复,只保留第3行,因为col1(row3)> col1(row2)> COL1(ROW1)。

现在这段代码在col4中删除重复项而不查看col1

awk '!seen[$4]++' myfile.csv

我想添加一个条件来检查每个重复项的col1并删除co​​l1中值最小的那个并保持行的值最高n col1

输出应为:

col1,col2,col3,col4

3,y,b,123

1,z,c,999

谢谢!

1 个答案:

答案 0 :(得分:0)

史密斯先生:请你试试,请告诉我这是否对你有帮助。

awk -F"[[:space:]]+,[[:space:]]+"  'FNR==NR{A[$NF]=$1>A[$NF]?$1:A[$NF];next} (($NF) in A) && $1 == A[$NF] && A[$NF]{print}'   Input_file  Input_file

编辑:尝试:

awk -F","  'FNR==NR{A[$NF]=$1>A[$NF]?$1:A[$NF];next} (($NF) in A) && $1 == A[$NF] && A[$NF]{print}' Input_file   Input_file

EDIT2: Following is explanation as per OP's request:
awk -F","                               ##### starting awk here and mentioning field delimiter as comma(,).
'FNR==NR{                               ##### FNR==NR condition will be TRUE only when Input_file first time is getting read.
                                              Because we want to save the values of last field as an index in array A and whose value is $1.
                                              So FNR and NR are the awk's default keywords, where the only difference between NR and FNR is 
                                              both will tell the number of lines but FNR will be RESET each time a new Input_file is being read,
                                              where NR will be keep on increasing till all the Input_files are completed. So this condition will be 
                                              TRUE only when first Input_file is being read.
A[$NF]=                                 ##### Now making an array named A whose index is $NF(last field of that array), then I am checking a condition
$1>A[$NF]                               ##### Condition here is if current line's $1 is greater than the value of A[$NF]'s value(Off course $NF last fields
                                              will be same for them then only they will be compared, so if $1's value is greater than A[$NF]'s value then 
?                                       ##### Using ? wild character means if condition is TRUE then perform following statements.
$1                                      ##### which is to make the value of A[$NF] to $1(because as per your requirement we need the HIGHEST value)
:                                       ##### If condition is FALSE which I explained 2 lines before than : operator indicates to perform actions which are following it.
A[$NF];                                 ##### Keep the value of A[$NF] same as [$NF] no change in it.
next}                                   ##### next is an awk's in built keyword so it will skip all further statements and take the control to again start from
                                              very first statement, off course it is used to avoid the execution of statements while first time Input_file is being read.
(($NF) in A) && $1 == A[$NF] && A[$NF]{ ##### So these conditions will be executed only and only when 2nd time Input_file is being read. Checking here 
                                              if $NF(last field of current line) comes in array A and array A's value is equal to first field and array A's value is NOT NULL.
print                                   ##### If above all conditions are TRUE then print the current line of Input_file
}' Input_file   Input_file              ##### Mentioning the Input_files here.