我有一个文件,也需要清除一些URL。 URL位于文件fileA和CSV fileB(这些文件大小为6-10 GB的巨大文件)中。我已经尝试了以下grep命令,但不适用于更新的fileB。
grep -vwF -f patterns.txt fileB.csv > result.csv
文件A的结构是单个URL列表,如下所示:
URLs (header, single column)
bwin.hu
paradisepoker.li
和fileB:
type|||URL|||Date|||Domain
1|||https://www.google.com|||1524024000|||google.com
2|||www.bwin.hu|||1524024324|||bwin.hu
fileB的分隔符为|||
我对包括awk在内的所有解决方案持开放态度。谢谢。
编辑:预期输出是CSV文件,其中保留与fileA中的域模式不匹配的所有行
type|||URL|||Date|||Domain
1|||https://www.google.com|||1524024000|||google.com
答案 0 :(得分:1)
请您尝试以下。
awk 'FNR==NR{a[$0];next} !($NF in a)' Input_filea FS="\\|\\|\\|" Input_fileb
OR
awk 'FNR==NR{a[$0];next} !($NF in a)' filea FS='\|\|\|' fileb
输出如下。
type|||URL|||Date|||Domain
1|||https://www.google.com|||1524024000|||google.com
说明: 现在添加上述代码的说明。
awk ' ##Starting awk program here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first Input_file named filea is being read.
a[$0] ##Creating an array named a whose index is $0(current line).
next ##next keyword will skip all further statements.
} ##Closing block for condition FNR==NR here.
!($NF in a) ##Checking condition if last field of current line is NOT present in array a for Input_fileb only.
##if condition is TRUE then no action is mentioned so by default print of current line will happen.
' filea FS="\\|\\|\\|" fileb ##Mentioning Input_file names and for fileb mentioning FS should be ||| escaped it here so that awk will consider it as a literal character.