I have a large text file: 400k rows and 10k columns, all numeric data values as 0,1,2. File size ranging 5-10GBs. I have a few missing values: NAs in the file. I want to replace the NA values with the column means, i.e. NA value in column 'x' must be replaced by the mean value of column 'x'. These are the steps that I want to do :
Data subset:
IID FID PAT MAT SEX PHENOTYPE X1 X2 X3 X4......
1234 1234 0 0 1 -9 0 NA 0 1
2346 2346 0 0 2 -9 1 2 NA 1
1334 1334 0 0 2 -9 2 NA 0 2
4566 4566 0 0 2 -9 2 2 NA 0
4567 4567 0 0 1 -9 NA NA 1 1
# total 400k rows and 10k columns
Desired Output:
# Assuming only 5 rows as given in the above example.
# Mean of column X1 = (0 + 1+ 2+ 2)/4 = 1.25
# Mean of column X2 = (2 + 2)/2 = 2
# Mean of column X3 = (0 + 0 + 1)/3 = 0.33
# Mean of column X4 = No NAs, so no replacements
# Replacing NAs with respective means:
IID FID PAT MAT SEX PHENOTYPE X1 X2 X3 X4......
1234 1234 0 0 1 -9 0 2 0 1
2346 2346 0 0 2 -9 1 2 0.33 1
1334 1334 0 0 2 -9 2 2 0 2
4566 4566 0 0 2 -9 2 2 0.33 0
4567 4567 0 0 1 -9 1.25 2 1 1
I tried this:
file="path/to/data.txt"
#get total number of columns
number_cols=$(awk -F' ' '{print NF; exit}' $file)
for ((i=7; i<=$number_cols; i=i+1))
do
echo $i
# getting the mean of each column
mean+=$(awk '{ total += $i } END { print total/NR }' $file)
done
# array of column means
echo ${mean[@]}
# find and replace (newstr must be replaced by respective column means)
find $file -type f -exec sed -i 's/NA/newstr/g' {} \;
However, this code is incomplete. The for loop is very slow since my data is huge. Is there another way to do this faster? I did this in Python and R, but it is too slow. I am open to get this done in any programming language as long as it is fast. Can someone please help me write the script?
Thanks