我正在尝试删除少量列,然后删除文件内容的唯一列。我要删除的列类似于月,日,时间和纪元时间;这些列在每行中都不同,并且不能让我对文件内容的唯一性。
sample.log的示例内容:
Jun 5 05:13:13 AAA AAA AAAA 1433495593.306611 XXXX CCCC CCCC AAAA SDDDD DFFFFF111
Jun 5 05:13:14 AAA AAA AAAA 1433495594.306612 XXXX CCCC CCCC AAAA SDDDD DFFFFF222
Jun 5 05:13:13 AAA AAA AAAA 1433495593.306611 XXXX CCCC CCCC AAAA SDDDD DFFFFF111
Jun 5 05:13:15 AAA AAA AAAA XXXXX 1433495596.306614 XXXX CCCC CCCC AAAA SDDDD DFFFFF111
Jun 5 05:13:16 AAA AAA AAAA XXXXX 1433495597.306615 XXXX CCCC CCCC AAAA SDDDD DFFFFF333
Jun 5 05:13:17 AAA AAA AAAA XXXXX 1433495598.306616 XXXX CCCC CCCC AAAA SDDDD DFFFFF444
问题:
月份,日期,时间都在固定列中,但是纪元时间是在第7和第8列之间切换。想知道如何处理这个问题。
示例输出:
Jun 5 05:13:13 AAA AAA AAAA 1433495593.306611 XXXX CCCC CCCC AAAA SDDDD DFFFFF111
Jun 5 05:13:13 AAA AAA AAAA 1433495593.306611 XXXX CCCC CCCC AAAA SDDDD DFFFFF111
Jun 5 05:13:15 AAA AAA AAAA XXXXX 1433495596.306614 XXXX CCCC CCCC AAAA SDDDD DFFFFF111
如果上面的内容太多,那么就像下面那样:
AAA AAA AAAA 1433495593.306611 XXXX CCCC CCCC AAAA SDDDD DFFFFF111
AAA AAA AAAA 1433495593.306611 XXXX CCCC CCCC AAAA SDDDD DFFFFF111
AAA AAA AAAA XXXXX 1433495596.306614 XXXX CCCC CCCC AAAA SDDDD DFFFFF111
我正在按照以下方向尝试,但不是很有帮助。
while read line
do
seven=$(echo $line |awk '{print $7}')
eight=$(echo $line |awk '{print $8}')
if [[ "$seven" =~ "^[0-9]" ]];then
#echo "seventh column starts with number"
echo $line|awk '$1=$2=$3=$7=" " {print}'
else
#echo "Eighth column starts with number"
echo $line|awk '$1=$2=$3=$8=" " {print}'
fi
done < $1
更多示例:
输入文件内容:
Jun 5 05:13:13 AAA BBB CCC 142222222222.000 DDD EEE FFFF
Jun 5 05:13:13 AAA BBB CCC 142222222223.000 DDD EEE FFFF
Jun 5 05:13:14 AAA BBB CCC 142222222224.000 DDD EEE GGGG
Jun 5 05:13:13 AAA BBB CCC XXX 142222222225.000 DDD EEE GGGG
Jun 5 05:13:13 AAA BBB CCC XXX 142222222225.000 DDD EEE FFFF
Jun 5 05:13:13 AAA BBB CCC XXX 142222222226.000 DDD EEE FFFF
输出:
Jun 5 05:13:13 AAA BBB CCC 142222222223.000 DDD EEE FFFF
Jun 5 05:13:13 AAA BBB CCC 142222222223.000 DDD EEE GGGG
Jun 5 05:13:13 AAA BBB CCC XXX 142222222225.000 DDD EEE GGGG
Jun 5 05:13:13 AAA BBB CCC XXX 142222222225.000 DDD EEE FFFF
OR
输出:
AAA BBB CCC DDD EEE FFFF
AAA BBB CCC DDD EEE GGGG
AAA BBB CCC XXX DDD EEE GGGG
AAA BBB CCC XXX DDD EEE FFFF
答案 0 :(得分:2)
一个非常基本的方法是检查字段的格式:如果它包含数字+ .
+数字,那就是那个!
awk '{$1=$2=$3=""
if ($7 ~ /^[0-9]+\.[0-9]+$/) {$7=""}
else {$8=""}
} 1' file
请注意,这会留下一些额外的空格,因为当您清空字段时,交错FS
仍然存在。要清除列,请检查Ed Morton对Print all but the first three columns的回答。
要确保每个第1,第2,第3和最后一列的列不重复,请使用awk '!uniq[$0]++' file
方法:
awk '!uniq[$1 $2 $3 $(NF-4) $(NF-2) $(NF-1) $NF]++{$1=$2=$3=""
if ($7 ~ /^[0-9]+\.[0-9]+$/) {$7=""}
else {$8=""}
} 1' file
答案 1 :(得分:2)
如果我正确理解了这个问题,那么就不需要Bash,只需要awk:
% awk '
{
for (f = 4; f <= NF; ++f) { # Start at column 4
if (f == 7 || f == 8) { # Treat columns 7 or 8 differently
if ($f !~ /^[0-9]+\.[0-9]+$/) { # Only print if non-numeric
printf $f " "
}
} else {
printf $f " "
}
}
printf "\n"
}
' sample.log
AAA AAA AAAA XXXX CCCC CCCC AAAA SDDDD DFFFFF111
AAA AAA AAAA XXXX CCCC CCCC AAAA SDDDD DFFFFF222
AAA AAA AAAA XXXX CCCC CCCC AAAA SDDDD DFFFFF111
AAA AAA AAAA XXXXX XXXX CCCC CCCC AAAA SDDDD DFFFFF111
AAA AAA AAAA XXXXX XXXX CCCC CCCC AAAA SDDDD DFFFFF333
AAA AAA AAAA XXXXX XXXX CCCC CCCC AAAA SDDDD DFFFFF444
获取唯一的行:
% awk '
{
for (f = 4; f <= NF; ++f) { # Start at column 4
if (f == 7 || f == 8) { # Treat columns 7 or 8 differently
if ($f !~ /^[0-9]+\.[0-9]+$/) { # Only print if non-numeric
printf $f " "
}
} else {
printf $f " "
}
}
printf "\n"
}
' sample2.log | sort -u
AAA BBB CCC DDD EEE FFFF
AAA BBB CCC DDD EEE GGGG
AAA BBB CCC XXX DDD EEE FFFF
AAA BBB CCC XXX DDD EEE GGGG
%s
... 如果您的输入文件包含%
个符号,则根据您的评论,您需要先将这些符号转义为printf
。你可以用这样的function
来做到这一点......
% awk '
function escape_percents(s)
{
gsub("%", "%%", s)
return s
}
{
for (f = 4; f <= NF; ++f) { # Start at column 4
if (f == 7 || f == 8) { # Treat columns 7 or 8 differently
if ($f !~ /^[0-9]+\.[0-9]+$/) { # Only print if non-numeric
printf escape_percents($f) " "
}
} else {
printf escape_percents($f) " "
}
}
printf "\n"
}
' sample2.log | sort -u
AAA BBB CCC DDD %E%E%E FFFF
AAA BBB CCC DDD %E%E%E GGGG
AAA BBB CCC XXX DDD %E%E%E FFFF
AAA BBB CCC XXX DDD %E%E%E GGGG
答案 2 :(得分:0)
如果纪元时间之后的列保持不变,那么最简单的方法就是只操纵NF。
使用更多示例中的输入:
awk '{NewLine=$4;
for(i=(NF-5);i>=0;i--){
if(i!=3){
NewLine=NewLine" "$(NF-i)
}
}
print NewLine
}' Sample.log | sort | uniq
使用输入
Jun 5 05:13:13 AAA BBB CCC 142222222222.000 DDD EEE FFFF
Jun 5 05:13:13 AAA BBB CCC 142222222223.000 DDD EEE FFFF
Jun 5 05:13:14 AAA BBB CCC 142222222224.000 DDD EEE GGGG
Jun 5 05:13:13 AAA BBB CCC XXX 142222222225.000 DDD EEE GGGG
Jun 5 05:13:13 AAA BBB CCC XXX 142222222225.000 DDD EEE FFFF
Jun 5 05:13:13 AAA BBB CCC XXX 142222222226.000 DDD EEE FFFF
你会得到
AAA BBB CCC DDD EEE FFFF
AAA BBB CCC DDD EEE GGGG
AAA BBB CCC XXX DDD EEE FFFF
AAA BBB CCC XXX DDD EEE GGGG