我有一个大的ascii文件,看起来像这样:
12,3,0.12,965.814
11,3,0.22,4313.2
14,3,0.42,7586.22
17,4,0,0
11,4,0,0
15,4,0,0
13,4,0,0
17,4,0,0
11,4,0,0
18,3,0.12,2764.86
12,3,0.22,2058.3
11,3,0.42,2929.62
10,4,0,0
10,4,0,0
14,4,0,0
12,4,0,0
19,3,0.12,1920.64
20,3,0.22,1721.51
12,3,0.42,1841.55
11,4,0,0
15,4,0,0
19,4,0,0
11,4,0,0
13,4,0,0
17,3,0.12,2738.99
12,3,0.22,1719.3
18,3,0.42,3757.72
.
.
.
我想用awk计算三个值的选定移动平均值。选择应该由第二和第三列完成。 应仅计算第二列为3的行的移动平均值。 我想计算第三列选择的三个移动平均线(每个“块”包含相同顺序的相同值)。 然后计算第四列的移动平均值。 我想输出第二个移动平均值的整行,并用结果替换第四列。 我知道这听起来很复杂,所以我将举例说明我想要计算的内容以及所需的结果:
(965.814+2764.86+1920.64)/3 = 1883.77
并将结果与第10行一起输出:
18,3,0.12,1883.77
然后继续第二,第十一和第十八行......
我的数据示例的最终结果应如下所示:
18,3,0.12,1883.77
12,3,0.22,2697.67
11,3,0.42,4119.13
19,3,0.12,2474.83
20,3,0.22,1833.04
12,3,0.42,2842.96
我尝试用awk中的以下代码计算移动平均值,但我认为我设计的脚本错误,因为awk告诉我每个“$ 2 == 3”的语法错误。
BEGIN { FS="," ; OFS = "," }
$2 == 3 {
a; b; c; d; e; f = 0
line1 = $0; a = $3; b = $4; getline
line2 = $0; c = $3; d = $4; getline
line3 = $0; e = $3; f = $4
$2 == 3 {
line11 = $0; a = $3; b += $4; getline
line22 = $0; c = $3; d += $4; getline
line33 = $0; e = $3; f += $4
$2 == 3 {
line111 = $0; a = $3; b += $4; getline
line222 = $0; c = $3; d += $4; getline
line333 = $0; e = $3; f += $4
}
}
$0 = line11; $3 = a; $4 = b/3; print
$0 = line22; $3 = c; $4 = d/3; print
$0 = line33; $3 = e; $4 = f/3
}
{print}
你能帮助我理解如何纠正我的脚本(我认为我对awk的哲学有缺点)或者启动一个完整的新脚本,因为那里有一个更简单的解决方案; - )
我还尝试了另一个想法:
BEGIN { FS="," ; OFS = "," }
i=0;
do {
i++;
a; b; c; d; e; f = 0
$2 == 3 {
line1 = $0; a = $3; b += $4; getline
line2 = $0; c = $3; d += $4; getline
line3 = $0; e = $3; f += $4
}while(i<3)
$0 = line1; $3 = a; $4 = b/3; print
$0 = line2; $3 = c; $4 = d/3; print
$0 = line3; $3 = e; $4 = f/3
}
{print}
这个也不起作用,awk给我两个语法错误(一个在“do”,另一个在“$$ 2 == 3”之后)。
我在两个脚本中都进行了更改并尝试了很多,并且在某些时候它们运行没有错误,但它们根本没有提供所需的输出,所以我认为必须有一个普遍的问题。
我希望你能帮助我,那真的很棒!
答案 0 :(得分:2)
如果使用正确的工具规范化输入,那么找到解决方案的任务就容易多了
我的想法是使用awk
选择$2==3
的记录,然后使用sort
将数据分组到第三列的数值
% echo '12,3,0.12,965.814
11,3,0.22,4313.2
14,3,0.42,7586.22
17,4,0,0
11,4,0,0
15,4,0,0
13,4,0,0
17,4,0,0
11,4,0,0
18,3,0.12,2764.86
12,3,0.22,2058.3
11,3,0.42,2929.62
10,4,0,0
10,4,0,0
14,4,0,0
12,4,0,0
19,3,0.12,1920.64
20,3,0.22,1721.51
12,3,0.42,1841.55
11,4,0,0
15,4,0,0
19,4,0,0
11,4,0,0
13,4,0,0
17,3,0.12,2738.99
12,3,0.22,1719.3
18,3,0.42,3757.72' | \
awk -F, '$2==3' | \
sort --field-separator=, --key=3,3 --numeric-sort --stable
12,3,0.12,965.814
18,3,0.12,2764.86
19,3,0.12,1920.64
17,3,0.12,2738.99
11,3,0.22,4313.2
12,3,0.22,2058.3
20,3,0.22,1721.51
12,3,0.22,1719.3
14,3,0.42,7586.22
11,3,0.42,2929.62
12,3,0.42,1841.55
18,3,0.42,3757.72
%
正如您所看到的,现在情况更加清晰,我们可以尝试设计一种算法来输出3个元素的运行平均值。
% awk -F, '$2==3' YOUR_FILE | \
sort --field-separator=, --key=3,3 --numeric-sort --stable | \
awk -F, '
$3!=prev {prev=$3
c=0
s[1]=0;s[2]=0;s[3]=0}
{old=new
new=$0
c = c+1; i = (c-1)%3+1; s[i] = $4
if(c>2)print old FS (s[1]+s[2]+s[3])/3}'
18,3,0.12,2764.86,1883.77
19,3,0.12,1920.64,2474.83
12,3,0.22,2058.3,2697.67
20,3,0.22,1721.51,1833.04
11,3,0.42,2929.62,4119.13
12,3,0.42,1841.55,2842.96
我忘记了你对替代$4
的要求,我会提出一个解决方案,除非你比我快......
编辑:更改行
{old=new
到
{split(new,old,",")
并更改行
if(c>2)print old FS (s[1]+s[2]+s[3])/3}'
到
if(c>2) print old[1] FS old[2] FS old[3] FS (s[1]+s[2]+s[3])/3}'