我有一个像这样的大数据框(仅显示前三列):
数据框称为chr22_hap12
2 1 3
2 1 3
2 1 3
2 1 2
2 2 1
2 2 1
我想获得每列的每个数字(按顺序为一,二,三)的比例,并将其存储在数据框中。
这是我到目前为止所做的:
for (i in 1:3 ) {
length(chr22_hap12[,i]) -> total_snps
sum(chr22_hap12[,i]==1,na.rm=FALSE) -> counts_ancestry_1
sum(chr22_hap12[,i]==2,na.rm=FALSE) -> counts_ancestry_2
sum(chr22_hap12[,i]==3,na.rm=FALSE) -> counts_ancestry_3
(counts_ancestry_1*100)/total_snps -> ancestry_1_perc
(counts_ancestry_2*100)/total_snps -> ancestry_2_perc
(counts_ancestry_3*100)/total_snps -> ancestry_3_perc
haplo_df[i] = NULL
haplo_df[i] = c(ancestry_1_perc,ancestry_2_perc,ancestry_3_perc)
as.data.frame(haplo_df[i])
}
我得到这些错误:在尝试设置haplo_df [i] = NULL
之后haplo_df [i] = NULL:对象' haplo_df'找不到
之后
haplo_df [i] = c(ancestry_1_perc,ancestry_2_perc,ancestry_3_perc)
haplo_df [i] = c(ancestry_1_perc,ancestry_2_perc, ancestry_3_perc):object' haplo_df'找不到
再次使用as.data.frame(haplo_df [i])
对象' haplo_df'找不到
我的愿望输出应如下所示:
0.00 66.66 50.0
100.00 33.33 33.33
0.00 0.00 16.66
答案 0 :(得分:1)
您需要在循环前定义结果matrix
,然后cbind
将新结果定义到matrix
。
# define the data.frame before the loop.
haplo_df <- NULL
for (i in 1:3 ) {
length(chr22_hap12[,i]) -> total_snps
sum(chr22_hap12[,i]==1,na.rm=FALSE) -> counts_ancestry_1
sum(chr22_hap12[,i]==2,na.rm=FALSE) -> counts_ancestry_2
sum(chr22_hap12[,i]==3,na.rm=FALSE) -> counts_ancestry_3
(counts_ancestry_1*100)/total_snps -> ancestry_1_perc
(counts_ancestry_2*100)/total_snps -> ancestry_2_perc
(counts_ancestry_3*100)/total_snps -> ancestry_3_perc
# bind the new result to the existing data
haplo_df <- cbind(haplo_df , c(ancestry_1_perc,ancestry_2_perc,ancestry_3_perc))
}
# return the result
haplo_df
## [,1] [,2] [,3]
## [1,] 0 66.66667 33.33333
## [2,] 100 33.33333 16.66667
## [3,] 0 0.00000 50.00000
相反,您也可以使用apply
和table
,例如
apply(chr22_hap12, 2, function(x) 100*table(factor(x, levels=1:3))/length(x))
## V1 V2 V3
## 1 0 66.66667 33.33333
## 2 100 33.33333 16.66667
## 3 0 0.00000 50.00000
答案 1 :(得分:1)
我的一个班轮
sapply(df, function(x){prop.table(table(x))*100})
答案 2 :(得分:0)
这是另一种方法。
示例数据:
set.seed(23)
y <- 1:3
df <- data.frame(a = sample(y, 10, replace = TRUE),
b = sample(y, 10, replace = TRUE),
c = sample(y, 10, replace = TRUE))
#df
# a b c
#1 2 3 2
#2 1 3 1
#3 1 2 1
#4 3 1 3
#5 3 3 2
#6 2 1 3
#7 3 2 3
#8 3 2 3
#9 3 3 1
#10 3 2 3
计算百分比:
newdf <- as.data.frame(t(do.call(rbind, lapply(df, function(z){
sapply(y, function(x) (sum(z == x) / length(z))*100)
}))))
#newdf
# a b c
#1 0.2 0.2 0.3
#2 0.2 0.4 0.2
#3 0.6 0.4 0.5
答案 3 :(得分:0)
尝试:
mydf
V1 V2 V3
1 2 1 3
2 2 1 3
3 2 1 3
4 2 1 2
5 2 2 1
6 2 2 1
ll = list()
for(cc in 1:3) {
dd = mydf[,cc]
n1 = 100*length(dd[dd==1])/nrow(mydf)
n2 = 100*length(dd[dd==2])/nrow(mydf)
n3 = 100*length(dd[dd==3])/nrow(mydf)
ll[[length(ll)+1]] = c(n1, n2, n3)
}
ll
[[1]]
[1] 0 100 0
[[2]]
[1] 66.66667 33.33333 0.00000
[[3]]
[1] 33.33333 16.66667 50.00000
> t(do.call(rbind, ll))
[,1] [,2] [,3]
[1,] 0 66.66667 33.33333
[2,] 100 33.33333 16.66667
[3,] 0 0.00000 50.00000