使用awk过滤掉无法识别的字段

时间:2019-01-03 16:45:42

标签: linux awk

我有一个CVS文件,我希望其中包含一些值,例如@itemsitem。人们正在添加要删除的评论或任意条目,例如Y

N

我可以使用NA?删除我期望的内容,例如:

Create,20055776,Y,,Y,Y,,Y,,NA?,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,,,,,,Y,,,NA ?,,,Y,,,,,,TBD,,,,,,,,,

但是,如果有人添加新评论,它将中断。我正在寻找一个正则表达式将匹配项概括为“非Y”。

我尝试了一些negative look arounds,但无法在我拥有的gsub的awk上运行。提前致谢!

4 个答案:

答案 0 :(得分:6)

awk 'BEGIN{FS=OFS=","}{for (i=3;i<=NF;i++) if ($i !~ /^(y|Y|n|N)$/) $i="";print}' test.CSV
Create,20055776,Y,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,,,,,,Y,,,,,,Y,,,,,,,,,,,,,,,

仅接受 是/否(不区分大小写)。

答案 1 :(得分:2)

 awk 'BEGIN{OFS=FS=","}{for(i=3;i<=NF;i++){if($i!~/^[Y]$/){$i=""}}; print;}'

这似乎可以解决问题。从第三个字段循环到最后一个字段,如果该字段不是Y,则将其替换为空。由于我们要修改字段,因此我们还需要设置OFS。

$ cat file.txt
Create,20055776,Y,,Y,Y,,Y,,NA?,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,,,,,,Y,,,NA ?,,,Y,,,,,,TBD,,,,,,,,,

$ awk 'BEGIN{OFS=FS=","}{for(i=3;i<=NF;i++){if($i!~/^[Y]$/){$i=""}}; print;}'
Create,20055776,Y,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,,,,,,Y,,,,,,Y,,,,,,,,,,,,,,,

如果您也想接受“ N”,则/^[YN]$/也可以使用。

答案 2 :(得分:1)

cat test.CSV | awk 'BEGIN{FS=OFS=","}{for (i=3;i<=NF;i++) if($i != "Y") $i=""; print}'

输出:

Create,20055776,Y,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,,,,,,Y,,,,,,Y,,,,,,,,,,,,,,,

更新: 因此,如果您只是想确定它是否为“ Y”,则无需使用正则表达式。

但是,如果您想使用正则表达式,因为zzevannn's answertink's answer已经给出了正则表达式 condition 的好主意,因此我将用regex代替批处理:

确切地说,为了增加挑战,我创建了一些边界条件:

$ cat test.CSV
Create,20055776,Y,,Y,Y,,Y,,YNA?,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,YN.Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,NANN,,,,,Y,,,NA ?Y,,,Y,,,,,,TYBD,,,,,,,,,

批量替换为:

$ awk 'BEGIN{FS=OFS=","}{fst=$1;sub($1 FS,"");print fst,gensub("(,)[^,]*[^Y,]+[^,]*","\\1","g",$0);}' test.CSV
Create,20055776,Y,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,,,,,,Y,,,,,,Y,,,,,,,,,,,,,,,

"(,)[^,]*[^Y,]+[^,]*"用于匹配两个逗号之间的任何字符,而不是单个Y
注意我保存了$1,并先删除了$1 and the comma after it,然后再打印回来。

答案 3 :(得分:0)

sed解决方案

# POSIX
sed -e ':a' -e 's/\(^Create,[0-9]*\(,Y\{0,1\}\)*\),[^Y,][^,]*/\1/;t a' test.csv

# GNU
sed ':a;s/\(^Create,[0-9]*\(,Y\{0,1\}\)*\),[^Y,][^,]*/\1/;ta' test.csv

对相同的概念有所了解(避免某些缺少OR正则表达式的sed问题)

awk -F ',' '{ Idx=$2;gsub(/,[[:blank:]]*[^YN,][^,]*/, "");sub( /,/, "," Idx);print}'