比较2个基于2列的未分类的大型CSV文件

时间:2011-08-09 16:41:34

标签: java python shell csv awk

我是基于第1列和第3列比较2个大型未排序.csv文件的任务。 每个文件包含大约200,000条记录。对于输出,我需要知道基于第1列和第3列的哪些记录存在于第一个文件中而不存在于第二个文件中。这些文件是逗号分隔的值文件。第3列需要在比较时忽略大小写。

示例文件1:

"id", "name", "email", "country"
"1233",  "jake", "jake@mailinator.com", "USA"
"2345", "alison", "Alison@mailinator.com", "Canada"
"3456", "jacob", "jacob@mailinator.com", "USA"
"5678", "natalia", "natalia@mailinator.com", "USA"

文件2

"id", "name", "email", "country"
"2345", "alison", "alison@mailinator.com", "Canada"
"3456", "jacob", "jacob@mailinator.com", "USA"
"5690", "lina", "lina@mailinator.com", "Canada" 

所需的输出文件

"5678", "natalia", "natalia@mailinator.com", "USA"

非常感谢代码示例。

5 个答案:

答案 0 :(得分:1)

尝试:

join -v 1 -i -t, -1 1 -2 1 -o 1.2 1.3 1.4 1.5  <(awk -F, '{print $1":"$3","$0}' f1.txt | sort) <(awk -F, '{print $1":"$3","$0}' f2.txt | sort)

工作原理:

1)我首先通过连接第1列和第3列来创建复合键列:

awk -F, '{print $1":"$3","$0}' f1.txt
awk -F, '{print $1":"$3","$0}' f2.txt

2)我对两个输出进行排序:

awk -F, '{print $1":"$3","$0}' f1.txt | sort 
awk -F, '{print $1":"$3","$0}' f2.txt | sort 

3)然后我使用join命令连接第一列(我的复合键)并输出来自文件1的不可用行。

<强>输出:

"1233",  "jake", "jake@mailinator.com", "USA"
"5678", "natalia", "natalia@mailinator.com", "USA"

答案 1 :(得分:0)

循环文件,将它们加载到2个阵列(或者可能是哈希)中,然后循环第二个文件,对每行进行排列。如果array1[n]array2[n]不在当前行的数组中,则输出为缺失。我会使用Perl来完成这项任务。

答案 2 :(得分:0)

awk 'BEGIN { FS="\", \""}
     FNR == 1 {read++;}
     FNR !=1 {if (read==1) {store[$1","tolower($3)] = $0} if (read==2) {delete store[$1","tolower($3)]}}
     END {for (i in store) {print store[i]}}' file1 file2

输出:

"1233",  "jake", "jake@mailinator.com", "USA"
"5678", "natalia", "natalia@mailinator.com", "USA"

答案 3 :(得分:0)

将文件内容加载到内存数据库中,例如H2并使用带有连接的SQL选择

答案 4 :(得分:0)

awk 'BEGIN { FS=OFS=","}; NR==FNR{a[tolower($1$3)]=++i;next} { if ( tolower($1$3) in a);else {print } }' file2 file1

输出:

"1233",  "jake", "jake@mailinator.com", "USA"
"5678", "natalia", "natalia@mailinator.com", "USA"