在R中追加列和计数频率

时间:2012-10-01 09:32:34

标签: r

我的数据框看起来像:

Name       Value1    Value2     Value3
sample1     ttn      mth        lik
sample2     bae      ttn.1      apk
sample3     pas      kasd       mth


dat <- structure(list(Name = c("sample1", "sample2", "sample3"), Value1 = c("ttn", 
"bae", "pas"), Value2 = c("mth", "ttn.1", "kasd"), Value3 = c("lik", 
"apk", "mth")), .Names = c("Name", "Value1", "Value2", "Value3"
), row.names = c(NA, -3L), class = "data.frame")

我想重新排列和计算频率,看起来像是:

  Value     Source1     Source2
  ttn       sample1
  mth       sample1     sample3
  lik       sample1

我该怎么做?

2 个答案:

答案 0 :(得分:2)

显然你会在这里看到一个参差不齐的数组,那么这个怎么样:

sapply(unique(unlist(dat[-1])), function(x) dat[apply(dat[-1],1,function(y) x%in%y),1])
$ttn
[1] "sample1"

$bae
[1] "sample2"

$pas
[1] "sample3"

$mth
[1] "sample1" "sample3"

$ttn.1
[1] "sample2"

$kasd
[1] "sample3"

$lik
[1] "sample1"

$apk
[1] "sample2"

答案 1 :(得分:1)

这些解决方案无法让您完全您想去的地方,但可能足够接近您的工作。

首先,一些数据:

temp <- structure(list(Name = c("sample1", "sample2", "sample3"), 
                       Value1 = c("ttn", "bae", "pas"), 
                       Value2 = c("mth", "ttn.1", "kasd"), 
                       Value3 = c("lik", "apk", "mth")), 
                  .Names = c("Name", "Value1", "Value2", "Value3"), 
                  class = "data.frame", row.names = c(NA, -3L))
temp
#      Name Value1 Value2 Value3
# 1 sample1    ttn    mth    lik
# 2 sample2    bae  ttn.1    apk
# 3 sample3    pas   kasd    mth

这些数据是“宽”形式。使用reshape()将其设置为“长”形式。

temp1 <- reshape(temp, direction = "long", 
                 idvar="Name", varying = 2:4, sep = "")
#              Name time Value
# sample1.1 sample1    1   ttn
# sample2.1 sample2    1   bae
# sample3.1 sample3    1   pas
# sample1.2 sample1    2   mth
# sample2.2 sample2    2 ttn.1
# sample3.2 sample3    2  kasd
# sample1.3 sample1    3   lik
# sample2.3 sample2    3   apk
# sample3.3 sample3    3   mth

现在,使用基础R中的aggregate()或“reshape2”包中的dcast()来根据“值”值进行聚合。

aggregate(Name ~ Value, temp1, c)
#   Value             Name
# 1   apk          sample2
# 2   bae          sample2
# 3  kasd          sample3
# 4   lik          sample1
# 5   mth sample1, sample3
# 6   pas          sample3
# 7   ttn          sample1
# 8 ttn.1          sample2
require(reshape2)
dcast(temp1, Value ~ Name, value.var = "Value")
#   Value sample1 sample2 sample3
# 1   apk    <NA>     apk    <NA>
# 2   bae    <NA>     bae    <NA>
# 3  kasd    <NA>    <NA>    kasd
# 4   lik     lik    <NA>    <NA>
# 5   mth     mth    <NA>     mth
# 6   pas    <NA>    <NA>     pas
# 7   ttn     ttn    <NA>    <NA>
# 8 ttn.1    <NA>   ttn.1    <NA>

您还提到您想要“计算频率”,在这种情况下,table()也可能是合适的:

table(temp1$Value, temp1$Name)
# 
#       sample1 sample2 sample3
# apk         0       1       0
# bae         0       1       0
# kasd        0       0       1
# lik         1       0       0
# mth         1       0       1
# pas         0       0       1
# ttn         1       0       0
# ttn.1       0       1       0