我有一个文件,其中包含第一列中的任务名称以及完成第二列中任务所需的时间,如下所示:
Task2, 3421
Task3, 3300
Task1, 1000
Task2, 1100
Task3, 1200
Task3, 1209
Task4, 1299
Task3, 1289
Task1, 1389
Task2, 1211
Task5, 1216
Task2, 1416
Task1, 2100
Task6, 2416
Task5, 2216
Task7, 1116
现在我必须以下面的格式找到每项任务和输出的最短和最长时间
task , maxtime , min time
e.g。
Task1, 1000, 2100 ( from the data given above)
答案 0 :(得分:4)
您可以尝试使用awk
awk '
BEGIN{FS=","; OFS=", "}
!($1 in max) || $2>max[$1]{max[$1]=$2}
!($1 in min) || $2<min[$1]{min[$1]=$2}
END{
for(k in max){print k, min[k], max[k]}
}' input.txt
你明白了,
Task1, 1000, 2100
Task2, 1100, 3421
Task3, 1200, 3300
Task4, 1299, 1299
Task5, 1216, 2216
Task6, 2416, 2416
Task7, 1116, 1116
答案 1 :(得分:1)
另一种方法是按列1排序,然后按列2排序,并为每个任务选择第一个和最后一个值
awk -F, '{arr[$1]=arr[$1] $2} END {for(key in arr) print key, arr[key]}' <(sort -t 1 -k 1,2 file) | awk '{OFS=", "; print $1, $2, $NF}'
示例运行:
$ cat file
Task2, 3421
Task3, 3300
Task1, 1000
Task2, 1100
Task3, 1200
Task3, 1209
Task4, 1299
Task3, 1289
Task1, 1389
Task2, 1211
Task5, 1216
Task2, 1416
Task1, 2100
Task6, 2416
Task5, 2216
Task7, 1116
$ sort -t 1 -k 1,2 file
Task1, 1000
Task1, 1389
Task1, 2100
Task2, 1100
Task2, 1211
Task2, 1416
Task2, 3421
Task3, 1200
Task3, 1209
Task3, 1289
Task3, 3300
Task4, 1299
Task5, 1216
Task5, 2216
Task6, 2416
Task7, 1116
$ awk -F, '{arr[$1]=arr[$1] $2} END {for(key in arr) print key, arr[key]}' <(sort -t 1 -k 1,2 file) | awk '{OFS=", "; print $1, $2, $NF}'
Task1, 1000, 2100
Task2, 1100, 3421
Task3, 1200, 3300
Task4, 1299, 1299
Task5, 1216, 2216
Task6, 2416, 2416
Task7, 1116, 1116
答案 2 :(得分:1)
使用gawk
的{{3}}:
gawk 'BEGIN{OFS=FS=","}
$2>a[$1]["max"]{a[$1]["max"]=$2}
$2<a[$1]["min"] || !a[$1]["min"] {a[$1]["min"]=$2}
END {for (i in a){
print i, a[i]["min"],a[i]["max"]
}
}' file
答案 3 :(得分:1)
这是另一种选择
$ join -t, <(sort file){,} | sort -k1,1 -k2n -k3nr | rev | uniq -2 | rev
答案 4 :(得分:1)
使用sort
,sed
和awk
sort -k1,1 -k2n input.txt | sed -r ':a;N;$!ba;:b;s/(Task[0-9]+, )([0-9 ,]+)\n?\1([0-9]+)/\1\2, \3/g;tb;' | awk 'BEGIN{FS=OFS=", ";}{print $1, $2, $NF}'
仅使用sort
和sed
的替代解决方案
sort -k1,1 -k2n input.txt | sed -r ':a;N;$!ba;:b;s/(Task[0-9]+, )([0-9 ,]+)\n?\1([0-9]+)/\1\2, \3/g;tb;' | sed -r -e 's/^([^ ]+)\s([^ ]+)\s.*\s([^ ]+)/\1 \2 \3/' -e 's/^([^ ]+)\s([^ ]+)$/\1 \2, \2/'
你明白了,
Task1, 1000, 2100
Task2, 1100, 3421
Task3, 1200, 3300
Task4, 1299, 1299
Task5, 1216, 2216
Task6, 2416, 2416
Task7, 1116, 1116
答案 5 :(得分:0)
sort
在第一列和第二列,然后awk它。这个解决方案中的好处(awk部分)是它不会将数据存储在内存中并最终将其转储出来,而是在找到新的数据后输出先前$1
的数据。在这里:
$ sort -t, -k1 foo -k2n | \ # sort
awk '!($1 in min) {min[$1]=$2} # first of each is always min (and max)
($1 in min) {max[$1]=$2} # every current one is always max
$1!=p && NR>1 {print p, min[p], max[p]} # if $1 differs from previous, print previous
{p=$1} # p is current for next round
END {print p, min[p], max[p]}' # dump buffer
Task1, 1000 2100
Task2, 1100 3421
Task3, 1200 3300
Task4, 1299 1299
Task5, 1216 2216
Task6, 2416 2416
Task7, 1116 1116
答案 6 :(得分:0)
这主要是bash,如果你遇到一些问题,我可以用其他东西替换awk命令......(例如colrm
如果时间总是在同一列中开始的话。)
# Keep a list of already processed task names
already_processed=""
# Use read to read only the first column from the data file
while IFS=',' read -ra task; do
# If the task has already been processed, skip it and go to the next line
if echo "$already_processed" | grep $task > /dev/null; then
continue
else
# Select all the task with the same name from the data file, take the
#+second column and sort it to find the max and the minimum.
MIN=`grep $task $1 | awk -F',' '{print $2}' | sort -n | head -1`
MAX=`grep $task $1 | awk -F',' '{print $2}' | sort -n | tail -1`
# Add the task to the "already_processed" tasks (to be sure each task will
#+appear only once in the output
already_processed="$already_processed:$task"
# Print the output in the wanted format.
echo "${task}, ${MIN}, ${MAX}"
fi
done < $1
请确保您的数据文件以空行结束。
示例:
bash <name_of_script_file> <name_of_data_file> | sort
Task1, 1000, 2100
Task2, 1100, 3421
Task3, 1200, 3300
Task4, 1299, 1299
Task5, 1216, 2216
Task6, 2416, 2416
Task7, 1116, 1116