我有一个包含多行数据的文件,有些是重复的,日期字段位于记录末尾。我希望能够扫描文件并保留最新记录。这是数据的样子:
00xbdf0c9fd6;joe@easy.us.com;20141231 <- remove this one
00vbdf0c9fd6;joe@easy.us.com;20150403 <- keep this one (newer date)
00dndf0ca080;betty@easy.us.com;20141231 <-keep
00dbkf0ca292;jerry@easy.us.com;20141231 <-keep
0dbds0ca2f6;john@easy.us.com;20141231 <- remove
0dbds0ca2f6;john@easy.us.com;20150403 <- keep (newer date)
我尝试了sed,awk,grep的各种风格和组合,但我无法让它工作。
答案 0 :(得分:0)
试试这个:
{
split($0,parts,/;/)
if (link[parts[2]] < parts[3]) {
link[parts[2]] = parts[3]
}
}
END {
for (l in link) {
print l,link[l]
}
}
产生
sue@easy.us.com 20141231
jerry@easy.us.com 20141231
joe@easy.us.com 20150403
betty@easy.us.com 20141231
john@easy.us.com 20150403
答案 1 :(得分:0)
为什么不根据地址和降序时间戳对文件进行排序?然后你需要做的就是保留第一个:
<infile sort -t\; -k2,2 -k3r | awk -F\; '!h[$2]++'
输出:
00dndf0ca080;betty@easy.us.com;20141231
00dbkf0ca292;jerry@easy.us.com;20141231
00vbdf0c9fd6;joe@easy.us.com;20150403
0dbds0ca2f6;john@easy.us.com;20150403