I have a large data file in the form:
Input_SNP Set_1 Set_2 Set_3 Set_4 Set_5 Set_6
1.09 0.162 NA 2.312 1.876 0.12 0.812
0.687 NA 0.987 1.32 1.11 1.04 NA
NA 1.890 0.923 1.43 0.900 2.02 2.7
2.801 0.642 0.791 0.812 NA 0.31 1.60
1.33 1.33 NA 1.22 0.23 0.18 1.77
2.91 1.00 1.651 NA 1.55 3.20 0.99
2.00 2.31 0.89 1.13 1.25 0.12 1.55
I would like to make a distribution of the totals in each column that are over 2.0. For example, Set_1 > 2 = 1, Set_2 > 2 = 0, Set_3 > 2 = 1. The issue is that each column has a "random" amount of missing data (NA). So that messes up the distribution. It seems my only option is to do a distribution of percentages. For example: Set_1 > 2 = 1/6, Set_2 > 2 = 0/5, Set_3 > 2 = 1/6. I would like to make a distribution of these percentages into a bell-curve of binned histogram. Despite my example, the percentages in each column over 2 should be between 0.00% and 3.00% so bins of size 0.05 would be nice. I would then like to plot my Input_SNP percentage on that distribution to get a p-value. Do you guys know how to do this in R? Currently this is in both a data.frame file and a .csv?
I had been trying: hist(colSums(as.matrix(df) > 2))
but that had not been working (I think because of the NAs). So how can I incorporate that?
My desired output is a histogram of percentages of each column that is over 2. The bins in the histogram can be 0.05.
答案 0 :(得分:1)
Perhaps you could try this, assuming your data is in a data.frame
called df
:
result <- unlist(lapply(sapply(df, function(x) which(x>2)), function(x) length(x)))
result
#Input_SNP Set_1 Set_2 Set_3 Set_4 Set_5 Set_6
# 2 1 0 1 0 2 1
In reality this is a 3 step process, first result <- sapply(df, function(x) which(x>2)
will give you the following structure:
#List of 7
#$ Input_SNP: int [1:2] 4 6
#$ Set_1 : int 7
#$ Set_2 : int(0)
#$ Set_3 : int 1
#$ Set_4 : int(0)
#$ Set_5 : int [1:2] 3 6
#$ Set_6 : int 3
And this is inserted in a lapply()
of the following form:
lapply(result, function(x) length(x))
For the following structure:
#List of 7
#$ Input_SNP: int 2
#$ Set_1 : int 1
#$ Set_2 : int 0
#$ Set_3 : int 1
#$ Set_4 : int 0
#$ Set_5 : int 2
#$ Set_6 : int 1
Finally this is unlisted for the final form.
If Input_SNP
should not be part of the desired result, remove it from the df
inside the sapply()
, like so:
unlist(lapply(sapply(df[,-1], function(x) which(x>2)), function(x) length(x)))
#Set_1 Set_2 Set_3 Set_4 Set_5 Set_6
#1 0 1 0 2 1
Finally for the proportions:
result/colSums(!is.na(df[,-1]))
# Set_1 Set_2 Set_3 Set_4 Set_5 Set_6
#0.1666667 0.0000000 0.1666667 0.0000000 0.2857143 0.1666667
答案 1 :(得分:1)