Question

I have a large data file in the form:

Input_SNP   Set_1    Set_2     Set_3     Set_4     Set_5     Set_6
1.09        0.162    NA        2.312     1.876     0.12      0.812
0.687       NA       0.987     1.32      1.11      1.04      NA
NA          1.890    0.923     1.43      0.900     2.02      2.7
2.801       0.642    0.791     0.812     NA        0.31      1.60
1.33        1.33     NA        1.22      0.23      0.18      1.77
2.91        1.00     1.651     NA        1.55      3.20      0.99
2.00        2.31     0.89      1.13      1.25      0.12      1.55

I would like to make a distribution of the totals in each column that are over 2.0. For example, Set_1 > 2 = 1, Set_2 > 2 = 0, Set_3 > 2 = 1. The issue is that each column has a "random" amount of missing data (NA). So that messes up the distribution. It seems my only option is to do a distribution of percentages. For example: Set_1 > 2 = 1/6, Set_2 > 2 = 0/5, Set_3 > 2 = 1/6. I would like to make a distribution of these percentages into a bell-curve of binned histogram. Despite my example, the percentages in each column over 2 should be between 0.00% and 3.00% so bins of size 0.05 would be nice. I would then like to plot my Input_SNP percentage on that distribution to get a p-value. Do you guys know how to do this in R? Currently this is in both a data.frame file and a .csv?

I had been trying: hist(colSums(as.matrix(df) > 2)) but that had not been working (I think because of the NAs). So how can I incorporate that?

My desired output is a histogram of percentages of each column that is over 2. The bins in the histogram can be 0.05.

Answer 1

Perhaps you could try this, assuming your data is in a data.frame called df:

result <- unlist(lapply(sapply(df, function(x) which(x>2)), function(x) length(x)))
result
#Input_SNP     Set_1     Set_2     Set_3     Set_4     Set_5     Set_6 
#    2         1         0         1         0         2         1

In reality this is a 3 step process, first result <- sapply(df, function(x) which(x>2) will give you the following structure:

#List of 7
#$ Input_SNP: int [1:2] 4 6
#$ Set_1    : int 7
#$ Set_2    : int(0) 
#$ Set_3    : int 1
#$ Set_4    : int(0) 
#$ Set_5    : int [1:2] 3 6
#$ Set_6    : int 3

And this is inserted in a lapply() of the following form:

lapply(result, function(x) length(x))

For the following structure:

#List of 7
#$ Input_SNP: int 2
#$ Set_1    : int 1
#$ Set_2    : int 0
#$ Set_3    : int 1
#$ Set_4    : int 0
#$ Set_5    : int 2
#$ Set_6    : int 1

Finally this is unlisted for the final form.

If Input_SNP should not be part of the desired result, remove it from the df inside the sapply(), like so:

unlist(lapply(sapply(df[,-1], function(x) which(x>2)), function(x) length(x)))
#Set_1 Set_2 Set_3 Set_4 Set_5 Set_6 
#1     0     1     0     2     1

Finally for the proportions:

result/colSums(!is.na(df[,-1]))
#    Set_1     Set_2     Set_3     Set_4     Set_5     Set_6 
#0.1666667 0.0000000 0.1666667 0.0000000 0.2857143 0.1666667

Answer 2

If you just want a histogram of the proportion of non-missing values >2, you can just do

hist(colMeans(as.matrix(df[,-1]) > 2, na.rm=TRUE))

The df[,-1] remove the Index_SNP column, and we use colMeans on the boolean values to get proportions.

R distribution plot with NA data and thresholds

2 个答案: