我是基于第1列和第3列比较2个大型未排序.csv
文件的任务。
每个文件包含大约200,000条记录。对于输出,我需要知道基于第1列和第3列的哪些记录存在于第一个文件中而不存在于第二个文件中。这些文件是逗号分隔的值文件。第3列需要在比较时忽略大小写。
示例文件1:
"id", "name", "email", "country"
"1233", "jake", "jake@mailinator.com", "USA"
"2345", "alison", "Alison@mailinator.com", "Canada"
"3456", "jacob", "jacob@mailinator.com", "USA"
"5678", "natalia", "natalia@mailinator.com", "USA"
文件2
"id", "name", "email", "country"
"2345", "alison", "alison@mailinator.com", "Canada"
"3456", "jacob", "jacob@mailinator.com", "USA"
"5690", "lina", "lina@mailinator.com", "Canada"
所需的输出文件
"5678", "natalia", "natalia@mailinator.com", "USA"
非常感谢代码示例。
答案 0 :(得分:1)
尝试:
join -v 1 -i -t, -1 1 -2 1 -o 1.2 1.3 1.4 1.5 <(awk -F, '{print $1":"$3","$0}' f1.txt | sort) <(awk -F, '{print $1":"$3","$0}' f2.txt | sort)
工作原理:
1)我首先通过连接第1列和第3列来创建复合键列:
awk -F, '{print $1":"$3","$0}' f1.txt
awk -F, '{print $1":"$3","$0}' f2.txt
2)我对两个输出进行排序:
awk -F, '{print $1":"$3","$0}' f1.txt | sort
awk -F, '{print $1":"$3","$0}' f2.txt | sort
3)然后我使用join
命令连接第一列(我的复合键)并输出来自文件1的不可用行。
<强>输出:强>
"1233", "jake", "jake@mailinator.com", "USA"
"5678", "natalia", "natalia@mailinator.com", "USA"
答案 1 :(得分:0)
循环文件,将它们加载到2个阵列(或者可能是哈希)中,然后循环第二个文件,对每行进行排列。如果array1[n]
和array2[n]
不在当前行的数组中,则输出为缺失。我会使用Perl来完成这项任务。
答案 2 :(得分:0)
awk 'BEGIN { FS="\", \""}
FNR == 1 {read++;}
FNR !=1 {if (read==1) {store[$1","tolower($3)] = $0} if (read==2) {delete store[$1","tolower($3)]}}
END {for (i in store) {print store[i]}}' file1 file2
输出:
"1233", "jake", "jake@mailinator.com", "USA"
"5678", "natalia", "natalia@mailinator.com", "USA"
答案 3 :(得分:0)
将文件内容加载到内存数据库中,例如H2并使用带有连接的SQL选择
答案 4 :(得分:0)
awk 'BEGIN { FS=OFS=","}; NR==FNR{a[tolower($1$3)]=++i;next} { if ( tolower($1$3) in a);else {print } }' file2 file1
输出:
"1233", "jake", "jake@mailinator.com", "USA"
"5678", "natalia", "natalia@mailinator.com", "USA"