Question

我有一个包含数千个行和数十列的大型制表符分隔数据表，并且缺少标记为“na”的数据。例如，

grid = griddata(np.array(coord_xy), np.array(coord_z), (X, Y), method='nearest')

我想计算每列的平均值，但要确保在计算中忽略丢失的数据。例如，第1列的平均值应为0.97。我相信我可以使用na 0.93 na 0 na 0.51 1 1 na 1 na 1 1 1 na 0.97 na 1 0.92 1 na 1 0.01 0.34，但我不知道如何构建命令来为所有列执行此操作并考虑缺少数据。

我所知道的只是计算单个列的平均值，但它将缺失的数据视为0，而不是将其从计算中删除。

awk

Answer 1

这是模糊的，但适用于您的示例

awk '{for(i=1; i<=NF; i++){sum[i] += $i; if($i != "na"){count[i]+=1}}} END {for(i=1; i<=NF; i++){if(count[i]!=0){v = sum[i]/count[i]}else{v = 0}; if(i<NF){printf "%f\t",v}else{print v}}}' input.txt

编辑： 以下是它的工作原理：

awk '{for(i=1; i<=NF; i++){ #for each column
        sum[i] += $i;       #add the sum to the "sum" array
        if($i != "na"){     #if value is not "na"
           count[i]+=1}     #increment the column "count"
        }                   #endif
     }                      #endfor
    END {                    #at the end
     for(i=1; i<=NF; i++){  #for each column
        if(count[i]!=0){        #if the column count is not 0
            v = sum[i]/count[i] #then calculate the column mean (here represented with "v")
        }else{                  #else (if column count is 0)
            v = 0               #then let mean be 0 (note: you can set this to be "na")
        };                      #endif col count is not 0
        if(i<NF){               #if the column is before the last column
            printf "%f\t",v     #print mean + TAB
        }else{                  #else (if it is the last column)
            print v}            #print mean + NEWLINE
        };                      #endif
     }' input.txt               #endfor (note: input.txt is the input file)

```

Answer 2

可能的解决方案：

awk -F"\t" '{for(i=1; i <= NF; i++)
                {if($i == $i+0){sum[i]+=$i; denom[i] += 1;}}}
            END{for(i=1; i<= NF; i++){line=line""sum[i]/(denom[i]?denom[i]:1)FS} 
                print line}' inputFile

给定数据的输出：

0.973333    0.9825  0   0.7425  0.01    0.7125

请注意，第三列仅包含＆＃34; na＆＃34;输出为0。如果您希望输出为na，请将END{...} - 块更改为：

END{for(i=1; i<= NF; i++){line=line""(denom[i] ? sum[i]/denom[i]:"na")FS} print line}'

使用awk计算每列的平均值，忽略丢失的数据

2 个答案: