Question

我有多行的csv文件。每行具有相同的列数。我需要做的是将这些行分组几个指定的列并聚合来自其他列的数据。输入文件示例：

proces1,pathA,5-May-2011,10-Sep-2017,5
proces2,pathB,6-Jun-2014,7-Jun-2015,2
proces1,pathB,6-Jun-2017,7-Jun-2017,1
proces1,pathA,11-Sep-2017,15-Oct-2017,2

对于上面的示例，我需要按前两列对行进行分组。从第3列我需要选择最小值，第4列最大值，第5列应该有总和。所以，对于这样的输入文件，我需要输出：

proces1,pathA,5-May-2011,15-Oct-2017,7
proces1,pathB,6-Jun-2017,7-Jun-2017,1
proces2,pathB,6-Jun-2014,7-Jun-2015,2

我需要用bash处理它（我也可以使用awk或sed）。

Answer 1

使用bash和sort：

#!/bin/bash

# create associative arrays 
declare -A month2num=([Jan]=1 [Feb]=2 [Mar]=3 [Apr]=4 [May]=5 [Jun]=6 [Jul]=7 [Aug]=8 [Sep]=9 [Oct]=10 [Nov]=11 [Dec]=12])
declare -A p ds de  # date start and date end
declare -A -i sum   # set integer attribute 

# function to convert 5-Jun-2011 to 20110605
date2num() { local d m y; IFS="-" read -r d m y <<< "$1"; printf "%d%.2d%.2d\n" $y ${month2num[$m]} $d; }

# read all columns to variables p1 p2 d1 d2 s
while IFS="," read -r p1 p2 d1 d2 s; do

  # if associative array is still empty for this entry
  # fill with current strings/value
  if [[ -z ${p[$p1,$p2]} ]]; then
    p[$p1,$p2]="$p1,$p2"
    ds[$p1,$p2]="$d1"
    de[$p1,$p2]="$d2"
    sum[$p1,$p2]="$s"
    continue
  fi

  # compare strings, set new strings and sum value
  if [[ ${p[$p1,$p2]} == "$p1,$p2" ]]; then
    [[ $(date2num "$d1") < $(date2num ${ds[$p1,$p2]}) ]] && ds[$p1,$p2]="$d1"
    [[ $(date2num "$d2") > $(date2num ${de[$p1,$p2]}) ]] && de[$p1,$p2]="$d2"
    sum[$p1,$p2]=sum[$p1,$p2]+s
  fi

done < file

# print content of all associative arrays with key vom associative array p
for i in "${!p[@]}"; do echo "${p[$i]},${ds[$i]},${de[$i]},${sum[$i]}"; done

用法：./script.sh | sort

输出到标准输出：

proces1,pathA,5-May-2011,15-Oct-2017,7
proces1,pathB,6-Jun-2017,7-Jun-2017,1
proces2,pathB,6-Jun-2014,7-Jun-2015,2

请参阅：help declare，help read，当然还有man bash

在bash脚本中聚合csv文件

1 个答案: