唯一的一部分

时间:2019-03-01 16:26:14

标签: email awk uniq

我正在尝试合并电子邮件列表,但我想uniq(或uniq -i -u)的电子邮件地址而不是整个行的名称,这样我们就不会重复。

列表1:

Company A <companya@companya.com>
Company B <companyb@companyb.com>
Company C <companyc@companyc.com>

列表2:

firstname lastname <firstname@gmail.com>
Fake Person <companyb@companyb.com>
Joe lastnanme <joe@gmail.com>

当前输出为

Company A <companya@companya.com>
Company B <companyb@companyb.com>
Company C <companyc@companyc.com>
firstname lastname <firstname@gmail.com>
Fake Person <companyb@companyb.com>
Joe lastnanme <joe@gmail.com>

所需的输出将是

Company A <companya@companya.com>
Company B <companyb@companyb.com>
Company C <companyc@companyc.com>
firstname lastname <firstname@gmail.com>
Joe lastnanme <joe@gmail.com>

(两者中都列出了companyb@companyb.com

我该怎么做?

5 个答案:

答案 0 :(得分:4)

指定文件格式

$ awk -F'[<>]' '!a[$2]++' files

将在尖括号中打印第一个重复内容的实例。或者,如果电子邮件后没有内容,则无需解开尖括号

$ awk '!a[$NF]++' files

sort也可以完成

$ sort -t'<' -k2,2 -u files

将根据需要(或不需要)对副作用进行排序。

N.B。对于这两种选择,假设尖括号都没有出现在电子邮件包装器之外的其他地方。

答案 1 :(得分:3)

这是awk中的一个:

$ awk '
match($0,/[a-z0-9.]+@[a-z.]+/) {      # look for emailish string *
    a[substr($0,RSTART,RLENGTH)]=$0   # and hash the record using the address as key
}
END {                                 # after all are processed
    for(i in a)                       # output them in no particular order
        print a[i]
}' file2 file1                        # switch order to see how it affects output

输出

Company A <companya@companya.com>
Company B <companyb@companyb.com>
Company C <companyc@companyc.com>
Joe lastnanme <joe@gmail.com>
firstname lastname <firstname@gmail.com>

脚本查找非常简单的电子邮件字符串(*请参见脚本中的正则表达式,并根据自己的喜好对其进行调整),该字符串用于散列整个记录,因为较早的实例被覆盖,最后一个实例获胜。

答案 2 :(得分:3)

uniq有一个-f选项,可以忽略多个以空格分隔的字段,因此我们可以对第三个字段进行排序,然后忽略前两个字段:

$ sort -k 3,3 infile | uniq -f 2
Company A <companya@companya.com>
Company B <companyb@companyb.com>
Company C <companyc@companyc.com>
firstname lastname <firstname@gmail.com>
Joe lastnanme <joe@gmail.com>

但是,这不是很可靠:电子邮件地址前没有确切两个字段时,它就会中断,因为排序将放在错误的字段上,而uniq将比较错误的字段。< / p>

检查karakfa的答案,看看这里什至不需要uniq

或者,仅检查最后一个字段的唯一性:

awk '!e[$NF] {print; ++e[$NF]}' infile

或更短的是,从awk '!e[$NF]++' infile

答案 3 :(得分:2)

请您尝试以下。

awk '
{
   match($0,/<.*>/)
   val=substr($0,RSTART,RLENGTH)
}
FNR==NR{
   a[val]=$0
   print
   next
}
!(val in a)
' list1 list2

说明: :添加了上述代码的说明。

awk '                                    ##Starting awk program here.
{                                        ##Starting BLOCK which will be executed for both of the Input_files.
   match($0,/<.*>/)                      ##Using match function of awk where giving regex to match everything from < to till >
   val=substr($0,RSTART,RLENGTH)         ##Creating variable named val whose value is substring of current line starting from RSTART to value of RLENGTH, basically matched string.
}                                        ##Closing above BLOCK here.
FNR==NR{                                 ##Checking condition FNR==NR which will be TRUE when 1st Input_file named list1 will be read.
   a[val]=$0                             ##Creating an array named a whose index is val and value is current line.
   print $0                              ##Printing current line here.
   next                                  ##next will skip all further statements from here.
}
!(val in a)                              ##Checking condition if variable val is NOT present in array a if it is NOT present then do printing of current line.
' list1 list2                            ##Mentioning Input_file names here.

输出如下。

Company A <companya@companya.com>
Company B <companyb@companyb.com>
Company C <companyc@companyc.com>
firstname lastname <firstname@gmail.com>
Joe lastnanme <joe@gmail.com>

答案 4 :(得分:0)

也许我不明白这个问题!
但是你可以尝试这个awk:

awk 'NR!=FNR && $3 in a{next}{a[$3]}1' list1 list2