AWK:像输入文件一样保持字段间距

时间:2015-03-23 13:27:38

标签: bash perl awk

我在下面的测试文件中模仿我的问题:

# cat out 
2014-01-10 18:23:25          0 Andy/ADPTER/
2014-01-10 18:23:36        503 Sandy/ADPTER/ACCOUNTTYPE MAP.csv
2014-01-10 18:23:38        516 John/ADPTER/CITY MAP.csv
2014-01-10 18:23:38        398 Wendy/ADPTER/COUNTRY MAP.csv
2014-01-10 18:23:38      11117 Andy/ADPTER/CURRENCY MAP.csv
2014-01-10 18:23:38        260 Sandy/ADPTER/GENDER MAP.csv
2014-01-10 18:23:39        466 John/ADPTER/STATE MAP.csv
2014-01-10 18:23:40        373 Jim/ADPTER/UNITS MAP.csv

这是我的Bash变量:

# echo $bucket
bucket_name

因此,在上面的文件中,我希望Bash变量值以第4个字段作为前缀。

这是我想要的输出:

2014-01-10 18:23:25          0 bucket_name/Andy/ADPTER/
2014-01-10 18:23:36        503 bucket_name/Sandy/ADPTER/ACCOUNTTYPE MAP.csv
2014-01-10 18:23:38        516 bucket_name/John/ADPTER/CITY MAP.csv
2014-01-10 18:23:38        398 bucket_name/Wendy/ADPTER/COUNTRY MAP.csv
2014-01-10 18:23:38      11117 bucket_name/Andy/ADPTER/CURRENCY MAP.csv
2014-01-10 18:23:38        260 bucket_name/Sandy/ADPTER/GENDER MAP.csv
2014-01-10 18:23:39        466 bucket_name/John/ADPTER/STATE MAP.csv
2014-01-10 18:23:40        373 bucket_name/Jim/ADPTER/UNITS MAP.csv

这就是我的尝试:

# awk -v var=$bucket '{$4=var"/"$4; print}' out 
2014-01-10 18:23:25 0 bucket_name/Andy/ADPTER/
2014-01-10 18:23:36 503 bucket_name/Sandy/ADPTER/ACCOUNTTYPE MAP.csv
2014-01-10 18:23:38 516 bucket_name/John/ADPTER/CITY MAP.csv
2014-01-10 18:23:38 398 bucket_name/Wendy/ADPTER/COUNTRY MAP.csv
2014-01-10 18:23:38 11117 bucket_name/Andy/ADPTER/CURRENCY MAP.csv
2014-01-10 18:23:38 260 bucket_name/Sandy/ADPTER/GENDER MAP.csv
2014-01-10 18:23:39 466 bucket_name/John/ADPTER/STATE MAP.csv
2014-01-10 18:23:40 373 bucket_name/Jim/ADPTER/UNITS MAP.csv

问题:

我的awk命令完成了我需要的操作,然而,它会弄乱外场间距(分隔符??)。我的意图是前缀bucket_name/到第4个字段并维护输入文件具有的任何间距方案(包括右/左对齐字段)。

这是我的另一次尝试:

# awk -v var=$bucket 'BEGIN{OFS="\t"}{$4=var"/"$4; print}' out 
2014-01-10  18:23:25    0   bucket_name/Andy/ADPTER/
2014-01-10  18:23:36    503 bucket_name/Sandy/ADPTER/ACCOUNTTYPE    MAP.csv
2014-01-10  18:23:38    516 bucket_name/John/ADPTER/CITY    MAP.csv
2014-01-10  18:23:38    398 bucket_name/Wendy/ADPTER/COUNTRY    MAP.csv
2014-01-10  18:23:38    11117   bucket_name/Andy/ADPTER/CURRENCY    MAP.csv
2014-01-10  18:23:38    260 bucket_name/Sandy/ADPTER/GENDER MAP.csv
2014-01-10  18:23:39    466 bucket_name/John/ADPTER/STATE   MAP.csv
2014-01-10  18:23:40    373 bucket_name/Jim/ADPTER/UNITS    MAP.csv

但它也没有帮助。

感谢。

5 个答案:

答案 0 :(得分:3)

您已在OP中标记了Perl,因此有一个Perl解决方案:

perl -pe'BEGIN{$var=shift}s,(?:.*?\s+){3}\K,$var/,' "$bucket" out

它与使用sed的{​​{3}}在技术上是相同的解决方案,但它的好处是避免了逃避问题。 Shell变量$bucket可以包含任何内容。

答案 1 :(得分:2)

你可以使用sed。

$ bucket='bucket_name'
$ sed "s~^\(\([^[:blank:]]\+[[:blank:]]\+\)\{3\}\)~\1$bucket/~" file
2014-01-10 18:23:25          0 bucket_name/Andy/ADPTER/
2014-01-10 18:23:36        503 bucket_name/Sandy/ADPTER/ACCOUNTTYPE MAP.csv
2014-01-10 18:23:38        516 bucket_name/John/ADPTER/CITY MAP.csv
2014-01-10 18:23:38        398 bucket_name/Wendy/ADPTER/COUNTRY MAP.csv
2014-01-10 18:23:38      11117 bucket_name/Andy/ADPTER/CURRENCY MAP.csv
2014-01-10 18:23:38        260 bucket_name/Sandy/ADPTER/GENDER MAP.csv
2014-01-10 18:23:39        466 bucket_name/John/ADPTER/STATE MAP.csv
2014-01-10 18:23:40        373 bucket_name/Jim/ADPTER/UNITS MAP.csv

[[:blank:]]\+ posix字符类,它匹配任何类型的水平空格字符,一次或多次。 [^[:blank:]]\+ POSIX否定了字符类,它匹配任何字符但不是空格一次或多次。

答案 2 :(得分:2)

您可以使用此awk

bucket="bucket_name"
awk --re-interval -v b="$bucket" '{sub(/([^[:blank:]]+[[:blank:]]+){3}/, 
     "&" b "/")} 1' file
2014-01-10 18:23:25          0 bucket_name/Andy/ADPTER/
2014-01-10 18:23:36        503 bucket_name/Sandy/ADPTER/ACCOUNTTYPE MAP.csv
2014-01-10 18:23:38        516 bucket_name/John/ADPTER/CITY MAP.csv
2014-01-10 18:23:38        398 bucket_name/Wendy/ADPTER/COUNTRY MAP.csv
2014-01-10 18:23:38      11117 bucket_name/Andy/ADPTER/CURRENCY MAP.csv
2014-01-10 18:23:38        260 bucket_name/Sandy/ADPTER/GENDER MAP.csv
2014-01-10 18:23:39        466 bucket_name/John/ADPTER/STATE MAP.csv
2014-01-10 18:23:40        373 bucket_name/Jim/ADPTER/UNITS MAP.csv

Online Working Demo

-v b="$bucket"                 # pass a value to awk in variable b
--re-interval                  # Enable the use of interval
                               # expressions in regular expression matching
sub                            # match input using regex and substitute with
                               # the given string
([^[:blank:]]+[[:blank:]]+){3} # match first 3 fields of the line separated by space/tab
 "&" b "/"                     # replace by matched string + var b + /

编辑:(感谢@EdMorton)要使其适用于参数中的任何值(例如,如果bucket="&"尝试两种解决方案),请使用:

awk --re-interval -v b="$bucket" 'match($0, /([^[:blank:]]+[[:blank:]]+){3}/) {
    $0 = substr($0, 1, RLENGTH) b "/" substr($0, RLENGTH+1) } 1' file

答案 3 :(得分:1)

这在awk中有点棘手,但是有一个相关的GNU扩展:在gawk中,split函数采用可选的第四个参数来保存实际的字段分隔符供以后使用。使用它:

gawk -v bucket="$bucket" '{ split($0, f, FS, d); d[NF] = ORS; f[4] = bucket "/" f[4]; for(i = 1; i <= NF; ++i) printf("%s%s", f[i], d[i]); }' filename

那是:

{
  split($0, f, FS, d)             # split line into fields, saving fields in
                                  # the f and delimiters in the d array
  d[NF] = ORS                     # for the newline at the end
  f[4] = bucket "/" f[4]          # fix fourth field
  for(i = 1; i <= NF; ++i) {      # then print the fields separated by the
    printf("%s%s", f[i], d[i]);   # saved delimiters
  }
}

附录:除非变量来自可信赖的来源并且保证不包含元字符,否则我不能真的建议使用sed执行此操作(否则您 会出现代码注入问题)。那说:sed的简单方法是

sed "s|[[:space:]]\+|&${bucket}/|3" filename

...将${bucket}追加到[[:space:]]\+的第三次出现。

答案 4 :(得分:1)

如果您要坚持使用awk,那么显式提供格式字符串可能最简单:

awk '{printf "%s %s %10s %s/%s\n", $1, $2, $3, b, $4}' b="$bucket" out