在bash脚本中聚合csv文件

时间:2017-11-26 09:23:24

标签: bash awk sed

我有多行的csv文件。每行具有相同的列数。我需要做的是将这些行分组几个指定的列并聚合来自其他列的数据。输入文件示例:

proces1,pathA,5-May-2011,10-Sep-2017,5
proces2,pathB,6-Jun-2014,7-Jun-2015,2
proces1,pathB,6-Jun-2017,7-Jun-2017,1
proces1,pathA,11-Sep-2017,15-Oct-2017,2

对于上面的示例,我需要按前两列对行进行分组。从第3列我需要选择最小值,第4列最大值,第5列应该有总和。所以,对于这样的输入文件,我需要输出:

proces1,pathA,5-May-2011,15-Oct-2017,7
proces1,pathB,6-Jun-2017,7-Jun-2017,1
proces2,pathB,6-Jun-2014,7-Jun-2015,2

我需要用bash处理它(我也可以使用awk或sed)。

1 个答案:

答案 0 :(得分:1)

使用bash和sort:

#!/bin/bash

# create associative arrays 
declare -A month2num=([Jan]=1 [Feb]=2 [Mar]=3 [Apr]=4 [May]=5 [Jun]=6 [Jul]=7 [Aug]=8 [Sep]=9 [Oct]=10 [Nov]=11 [Dec]=12])
declare -A p ds de  # date start and date end
declare -A -i sum   # set integer attribute 

# function to convert 5-Jun-2011 to 20110605
date2num() { local d m y; IFS="-" read -r d m y <<< "$1"; printf "%d%.2d%.2d\n" $y ${month2num[$m]} $d; }

# read all columns to variables p1 p2 d1 d2 s
while IFS="," read -r p1 p2 d1 d2 s; do

  # if associative array is still empty for this entry
  # fill with current strings/value
  if [[ -z ${p[$p1,$p2]} ]]; then
    p[$p1,$p2]="$p1,$p2"
    ds[$p1,$p2]="$d1"
    de[$p1,$p2]="$d2"
    sum[$p1,$p2]="$s"
    continue
  fi

  # compare strings, set new strings and sum value
  if [[ ${p[$p1,$p2]} == "$p1,$p2" ]]; then
    [[ $(date2num "$d1") < $(date2num ${ds[$p1,$p2]}) ]] && ds[$p1,$p2]="$d1"
    [[ $(date2num "$d2") > $(date2num ${de[$p1,$p2]}) ]] && de[$p1,$p2]="$d2"
    sum[$p1,$p2]=sum[$p1,$p2]+s
  fi

done < file

# print content of all associative arrays with key vom associative array p
for i in "${!p[@]}"; do echo "${p[$i]},${ds[$i]},${de[$i]},${sum[$i]}"; done

用法:./script.sh | sort

输出到标准输出:

proces1,pathA,5-May-2011,15-Oct-2017,7
proces1,pathB,6-Jun-2017,7-Jun-2017,1
proces2,pathB,6-Jun-2014,7-Jun-2015,2

请参阅:help declarehelp read,当然还有man bash