我有两个文件,即:
file1.csv
:
"ACCOUNT_ID","CTN","NAME","GATEWAY_GUID","DEVICE_GUID","CATALOG_ID","FW_VERSION","DATE_CREATED","STATUS_ID","LOCATION_CODE","BAN","Market_Area","State","IMEI","HW_MODEL"
"306875",="9404653975","14-052917 14-052917","313A0B72E3E440DD8687BD681E55FB03","SD0A1B3844",="0100E0102000004","","06/24/2014 19:38:44","0",="0003002008",="177046772949","DLS","TX",="351612051721824",""
和file2.csv
:
account,ban,ctn,first_name,last_name,device_gateway_guid,device_id,device_cat_id,IMEI,device_fw_vrsn,date_created,device_status,subscription_created,subscription_name,subscription_market,date
DL!813269 , 418069632891 , undefined , MUHAMMAD , ANJUM , 313A0B72E3E440DD8687BD681E55FB03, ACFF010904 , 00010907000004 , 351612054025777 , , 2015-12-18 19:45:31 , 0 , undefined , [object Object] , WAS , undefined
我希望将file1
的第4和第5列连接到
313A0B72E3E440DD8687BD681E55FB03SD0A1B3844
以及file2
到
313A0B72E3E440DD8687BD681E55FB03ACFF010904
然后我想将file1
的连接字符串与file2
的连接字符串进行比较;输出应该是file2
中未显示的file1
的所有记录。
示例的输出:
313A0B72E3E440DD8687BD681E55FB03SD0A1B3844
因为它位于file1
而非file2
。我关心的是file1
而不是file2
中的记录。
这就是我的尝试:
awk -F'[ "]*,[ "]*' 'NR==FNR{a[$6$7];next} (FNR==1) || !($4$5 in a)' file2.csv file1.csv
但这只会产生大约15,000条记录,我预计会有大约160,000条记录。
答案 0 :(得分:0)
使用Python(未经测试):
import csv
with open('file1.csv') as i:
reader = csv.reader(i)
set1 = set(tuple(line[3:4]) for line in reader)
with open('file2.csv') as i:
reader = csv.reader(i)
for line in reader:
if tuple(line[5:6]) in set1:
continue
print(line)