Replace NAs with their respective column means from very large text file

时间:2019-01-18 18:58:00

标签: linux bash shell

I have a large text file: 400k rows and 10k columns, all numeric data values as 0,1,2. File size ranging 5-10GBs. I have a few missing values: NAs in the file. I want to replace the NA values with the column means, i.e. NA value in column 'x' must be replaced by the mean value of column 'x'. These are the steps that I want to do :

  1. Compute means of each column of my text file (excluding the header and starting from column7th)
  2. Replace NA in each column with their respective column means
  3. Write the modified file back as a txt file

Data subset:

IID  FID  PAT MAT SEX PHENOTYPE X1 X2 X3 X4......
1234 1234  0  0    1   -9       0  NA  0  1 
2346 2346  0  0    2   -9       1  2  NA  1
1334 1334  0  0    2   -9       2  NA  0  2
4566 4566  0  0    2   -9       2  2  NA  0
4567 4567  0  0    1   -9       NA NA  1  1

# total 400k rows and 10k columns

Desired Output:

# Assuming only 5 rows as given in the above example.
# Mean of column X1 = (0 + 1+ 2+ 2)/4 = 1.25
# Mean of column X2 =  (2 + 2)/2 = 2
# Mean of column X3 = (0 + 0 + 1)/3 = 0.33
# Mean of column X4 = No NAs, so no replacements

# Replacing NAs with respective means:

IID  FID  PAT MAT SEX PHENOTYPE X1   X2  X3   X4......
1234 1234  0  0    1   -9       0    2   0     1 
2346 2346  0  0    2   -9       1    2   0.33  1
1334 1334  0  0    2   -9       2    2   0     2
4566 4566  0  0    2   -9       2    2   0.33  0
4567 4567  0  0    1   -9       1.25 2   1     1

I tried this:

file="path/to/data.txt"

#get total number of columns
number_cols=$(awk -F' ' '{print NF; exit}' $file)

for ((i=7; i<=$number_cols; i=i+1))
do 
    echo $i
    # getting the mean of each column
    mean+=$(awk '{ total += $i } END { print total/NR }' $file)
done

# array of column means
echo ${mean[@]}

# find and replace (newstr must be replaced by respective column means)
find $file -type f -exec sed -i 's/NA/newstr/g' {} \;

However, this code is incomplete. The for loop is very slow since my data is huge. Is there another way to do this faster? I did this in Python and R, but it is too slow. I am open to get this done in any programming language as long as it is fast. Can someone please help me write the script?

Thanks

0 个答案:

没有答案