合并R中表的结果

时间:2016-02-13 14:29:13

标签: r

我想计算每个mydata列的三个因子的出现次数,所以我想到了函数表

mydata的一些数据:

              A0AUT     A0AYT     A0AZT     A0B2T     A0B3T
100130426 no_change no_change no_change no_change no_change
100133144 no_change no_change      down no_change no_change
100134869 no_change no_change no_change no_change no_change
10357     no_change        up no_change no_change        up
10431     no_change        up no_change no_change no_change
136542    no_change        up no_change no_change no_change
> str(mydata)
'data.frame':   20531 obs. of  518 variables:
 $ A0AUT: Factor w/ 3 levels "down","no_change",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ A0AYT: Factor w/ 3 levels "down","no_change",..: 2 2 2 3 3 3 2 2 2 3 ...
 $ A0AZT: Factor w/ 3 levels "down","no_change",..: 2 1 2 2 2 2 1 2 2 2 ...
 $ A0B2T: Factor w/ 3 levels "down","no_change",..: 2 2 2 2 2 2 1 2 2 2 ...
 $ A0B3T: Factor w/ 3 levels "down","no_change",..: 2 2 2 3 2 2 2 2 2 2 ...
 $ A0B5T: Factor w/ 3 levels "down","no_change",..: 2 2 2 3 2 2 2 2 2 2 ...
 $ A0B7T: Factor w/ 3 levels "down","no_change",..: 2 2 2 2 2 2 1 2 2 2 ...
 $ A0B8T: Factor w/ 3 levels "down","no_change",..: 2 1 1 2 3 2 2 2 2 2 ...
 $ A0BAT: Factor w/ 3 levels "down","no_change",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ A0BCT: Factor w/ 3 levels "down","no_change",..: 2 2 2 2 3 2 2 2 2 2 ...

现在我做了:

occurences <- apply(mydata, 1, table)
> occurences[[1]] # 100130426

no_change        up 
      508        10 
> occurences[[2]] # 100133144

     down no_change        up 
       45       446        27 

但我希望它们作为矩阵(或者至少我认为它更容易处理)所以我做了这个:

  freq <- sapply(occurences, function(x){
    c(x, rep(0, 3 - length(x)))
  })

> freq[,1:5]
          100130426 100133144 100134869 10357 10431
no_change       508        45        14     3     3
up               10       446       411   330   268
                  0        27        93   185   247

但是你可以看到100133144的no_change数量已经上升了!

我的预期输出是:

> freq[,1:5]
              100130426 100133144 100134869 10357 10431
    up               10        45        14     3     3
    no_change       508       446       411   330   268
    down              0        27        93   185   247

我怎样才能使每个值都放好? 正如您所看到的,每个表可能只有一到三个元素,所以这样做:

freq <- matrix(unlist(occurences), nrow=3)

导致错误,因为不是3的倍数。

我可能采取了一种糟糕的方法来按列计算mydata的频率。我更倾向于使用基本R的方法,而不使用任何库

2 个答案:

答案 0 :(得分:2)

我们可以使用table。转换&#39; data.frame&#39;到&#39;矩阵&#39;并从“广泛”中重新塑造长期&#39; (使用melt中的reshape2),并在相关列上调用table以获取频率计数。

library(reshape2)
table(melt(as.matrix(mydata))[c(3,1)])
#              Var1
#value       10357 10431 136542 100130426 100133144 100134869
#  down          0     0      0         0         1         0
#  no_change     3     4      4         5         4         5
#  up            2     1      1         0         0         0

或仅使用base R,我们只需unlist数据即可获得vector,复制&#39;行名称&#39; (使用col)然后调用table

table(unlist(mydata), row.names(mydata)[col(mydata)])
#             Var1
#value       10357 10431 136542 100130426 100133144 100134869
#  down          0     0      0         0         1         0
#  no_change     3     4      4         5         4         5
#  up            2     1      1         0         0         0

另一个选项是dplyr/tidyr

library(dplyr)
library(tidyr)
add_rownames(mydata) %>%
    gather(Var, Val,-rowname) %>% 
    group_by(rowname, Val) %>%
    summarise(n=n()) %>% 
    spread(rowname, n, fill=0)

更新

如果数据集列为factor,我们可以在执行character之前将其转换为unlist

mydata[] <- lapply(mydata, as.character)

UPDATE2

如果这是基于每一行

library(qdapTools)
t(mtabulate(as.data.frame(t(mydata))))
#          100130426 100133144 100134869 10357 10431 136542
#no_change         5         4         5     3     4      4
#down              0         1         0     0     0      0
#up                0         0         0     2     1      1

或仅使用base R,我们会在数据集中创建一个唯一元素的向量(&#39; nm1&#39; - 此处已知,但如果不是,nm1 <- unique(unlist(lapply(mydata, as.character)))) ,然后使用带有apply的{​​{1}}循环遍历行,在将行向量转换为MARGIN=1并将tabulate指定为&#39; nm1&#后,使用factor 39 ;.在levels中,我们还可以指定返回向量的长度,即nm1&#39;的长度。输出将是tabulate。我们可以将行名称(matrix)指定为&#39; nm1&#39;。

row.names<-

数据

nm1 <- c('up', 'no_change', 'down')
`row.names<-`(apply(mydata, 1, function(x)
     tabulate(factor(x, levels=nm1),length(nm1))), nm1)
#          100130426 100133144 100134869 10357 10431 136542
#up                0         0         0     2     1      1
#no_change         5         4         5     3     4      4
#down              0         1         0     0     0      0

答案 1 :(得分:2)

将我的评论推荐给答案:

library(reshape2)
dcast(melt(mydf, id="id"), value + variable ~ id, length)

这假设数字是id变量。如果它们存储为rownumbers:

dcast(melt(as.matrix(mydf)), value ~ Var1)

两者都给:

      value 10357 10431 136542 100130426 100133144 100134869
1      down     0     0      0         0         1         0
2 no_change     3     4      4         5         4         5
3        up     2     1      1         0         0         0