我想计算每个mydata列的三个因子的出现次数,所以我想到了函数表
mydata的一些数据:
A0AUT A0AYT A0AZT A0B2T A0B3T
100130426 no_change no_change no_change no_change no_change
100133144 no_change no_change down no_change no_change
100134869 no_change no_change no_change no_change no_change
10357 no_change up no_change no_change up
10431 no_change up no_change no_change no_change
136542 no_change up no_change no_change no_change
> str(mydata)
'data.frame': 20531 obs. of 518 variables:
$ A0AUT: Factor w/ 3 levels "down","no_change",..: 2 2 2 2 2 2 2 2 2 2 ...
$ A0AYT: Factor w/ 3 levels "down","no_change",..: 2 2 2 3 3 3 2 2 2 3 ...
$ A0AZT: Factor w/ 3 levels "down","no_change",..: 2 1 2 2 2 2 1 2 2 2 ...
$ A0B2T: Factor w/ 3 levels "down","no_change",..: 2 2 2 2 2 2 1 2 2 2 ...
$ A0B3T: Factor w/ 3 levels "down","no_change",..: 2 2 2 3 2 2 2 2 2 2 ...
$ A0B5T: Factor w/ 3 levels "down","no_change",..: 2 2 2 3 2 2 2 2 2 2 ...
$ A0B7T: Factor w/ 3 levels "down","no_change",..: 2 2 2 2 2 2 1 2 2 2 ...
$ A0B8T: Factor w/ 3 levels "down","no_change",..: 2 1 1 2 3 2 2 2 2 2 ...
$ A0BAT: Factor w/ 3 levels "down","no_change",..: 2 2 2 2 2 2 2 2 2 2 ...
$ A0BCT: Factor w/ 3 levels "down","no_change",..: 2 2 2 2 3 2 2 2 2 2 ...
现在我做了:
occurences <- apply(mydata, 1, table)
> occurences[[1]] # 100130426
no_change up
508 10
> occurences[[2]] # 100133144
down no_change up
45 446 27
但我希望它们作为矩阵(或者至少我认为它更容易处理)所以我做了这个:
freq <- sapply(occurences, function(x){
c(x, rep(0, 3 - length(x)))
})
> freq[,1:5]
100130426 100133144 100134869 10357 10431
no_change 508 45 14 3 3
up 10 446 411 330 268
0 27 93 185 247
但是你可以看到100133144的no_change数量已经上升了!
我的预期输出是:
> freq[,1:5]
100130426 100133144 100134869 10357 10431
up 10 45 14 3 3
no_change 508 446 411 330 268
down 0 27 93 185 247
我怎样才能使每个值都放好? 正如您所看到的,每个表可能只有一到三个元素,所以这样做:
freq <- matrix(unlist(occurences), nrow=3)
导致错误,因为不是3的倍数。
我可能采取了一种糟糕的方法来按列计算mydata的频率。我更倾向于使用基本R的方法,而不使用任何库
答案 0 :(得分:2)
我们可以使用table
。转换&#39; data.frame&#39;到&#39;矩阵&#39;并从“广泛”中重新塑造长期&#39; (使用melt
中的reshape2
),并在相关列上调用table
以获取频率计数。
library(reshape2)
table(melt(as.matrix(mydata))[c(3,1)])
# Var1
#value 10357 10431 136542 100130426 100133144 100134869
# down 0 0 0 0 1 0
# no_change 3 4 4 5 4 5
# up 2 1 1 0 0 0
或仅使用base R
,我们只需unlist
数据即可获得vector
,复制&#39;行名称&#39; (使用col
)然后调用table
table(unlist(mydata), row.names(mydata)[col(mydata)])
# Var1
#value 10357 10431 136542 100130426 100133144 100134869
# down 0 0 0 0 1 0
# no_change 3 4 4 5 4 5
# up 2 1 1 0 0 0
另一个选项是dplyr/tidyr
library(dplyr)
library(tidyr)
add_rownames(mydata) %>%
gather(Var, Val,-rowname) %>%
group_by(rowname, Val) %>%
summarise(n=n()) %>%
spread(rowname, n, fill=0)
如果数据集列为factor
,我们可以在执行character
之前将其转换为unlist
类
mydata[] <- lapply(mydata, as.character)
如果这是基于每一行
library(qdapTools)
t(mtabulate(as.data.frame(t(mydata))))
# 100130426 100133144 100134869 10357 10431 136542
#no_change 5 4 5 3 4 4
#down 0 1 0 0 0 0
#up 0 0 0 2 1 1
或仅使用base R
,我们会在数据集中创建一个唯一元素的向量(&#39; nm1&#39; - 此处已知,但如果不是,nm1 <- unique(unlist(lapply(mydata, as.character)))
) ,然后使用带有apply
的{{1}}循环遍历行,在将行向量转换为MARGIN=1
并将tabulate
指定为&#39; nm1&#后,使用factor
39 ;.在levels
中,我们还可以指定返回向量的长度,即nm1&#39;的长度。输出将是tabulate
。我们可以将行名称(matrix
)指定为&#39; nm1&#39;。
row.names<-
nm1 <- c('up', 'no_change', 'down')
`row.names<-`(apply(mydata, 1, function(x)
tabulate(factor(x, levels=nm1),length(nm1))), nm1)
# 100130426 100133144 100134869 10357 10431 136542
#up 0 0 0 2 1 1
#no_change 5 4 5 3 4 4
#down 0 1 0 0 0 0
答案 1 :(得分:2)
将我的评论推荐给答案:
library(reshape2)
dcast(melt(mydf, id="id"), value + variable ~ id, length)
这假设数字是id变量。如果它们存储为rownumbers:
dcast(melt(as.matrix(mydf)), value ~ Var1)
两者都给:
value 10357 10431 136542 100130426 100133144 100134869
1 down 0 0 0 0 1 0
2 no_change 3 4 4 5 4 5
3 up 2 1 1 0 0 0