我的数据框看起来像:
Name Value1 Value2 Value3
sample1 ttn mth lik
sample2 bae ttn.1 apk
sample3 pas kasd mth
dat <- structure(list(Name = c("sample1", "sample2", "sample3"), Value1 = c("ttn",
"bae", "pas"), Value2 = c("mth", "ttn.1", "kasd"), Value3 = c("lik",
"apk", "mth")), .Names = c("Name", "Value1", "Value2", "Value3"
), row.names = c(NA, -3L), class = "data.frame")
我想重新排列和计算频率,看起来像是:
Value Source1 Source2
ttn sample1
mth sample1 sample3
lik sample1
我该怎么做?
答案 0 :(得分:2)
显然你会在这里看到一个参差不齐的数组,那么这个怎么样:
sapply(unique(unlist(dat[-1])), function(x) dat[apply(dat[-1],1,function(y) x%in%y),1])
$ttn
[1] "sample1"
$bae
[1] "sample2"
$pas
[1] "sample3"
$mth
[1] "sample1" "sample3"
$ttn.1
[1] "sample2"
$kasd
[1] "sample3"
$lik
[1] "sample1"
$apk
[1] "sample2"
答案 1 :(得分:1)
这些解决方案无法让您完全您想去的地方,但可能足够接近您的工作。
首先,一些数据:
temp <- structure(list(Name = c("sample1", "sample2", "sample3"),
Value1 = c("ttn", "bae", "pas"),
Value2 = c("mth", "ttn.1", "kasd"),
Value3 = c("lik", "apk", "mth")),
.Names = c("Name", "Value1", "Value2", "Value3"),
class = "data.frame", row.names = c(NA, -3L))
temp
# Name Value1 Value2 Value3
# 1 sample1 ttn mth lik
# 2 sample2 bae ttn.1 apk
# 3 sample3 pas kasd mth
这些数据是“宽”形式。使用reshape()
将其设置为“长”形式。
temp1 <- reshape(temp, direction = "long",
idvar="Name", varying = 2:4, sep = "")
# Name time Value
# sample1.1 sample1 1 ttn
# sample2.1 sample2 1 bae
# sample3.1 sample3 1 pas
# sample1.2 sample1 2 mth
# sample2.2 sample2 2 ttn.1
# sample3.2 sample3 2 kasd
# sample1.3 sample1 3 lik
# sample2.3 sample2 3 apk
# sample3.3 sample3 3 mth
现在,使用基础R中的aggregate()
或“reshape2”包中的dcast()
来根据“值”值进行聚合。
aggregate(Name ~ Value, temp1, c)
# Value Name
# 1 apk sample2
# 2 bae sample2
# 3 kasd sample3
# 4 lik sample1
# 5 mth sample1, sample3
# 6 pas sample3
# 7 ttn sample1
# 8 ttn.1 sample2
require(reshape2)
dcast(temp1, Value ~ Name, value.var = "Value")
# Value sample1 sample2 sample3
# 1 apk <NA> apk <NA>
# 2 bae <NA> bae <NA>
# 3 kasd <NA> <NA> kasd
# 4 lik lik <NA> <NA>
# 5 mth mth <NA> mth
# 6 pas <NA> <NA> pas
# 7 ttn ttn <NA> <NA>
# 8 ttn.1 <NA> ttn.1 <NA>
您还提到您想要“计算频率”,在这种情况下,table()
也可能是合适的:
table(temp1$Value, temp1$Name)
#
# sample1 sample2 sample3
# apk 0 1 0
# bae 0 1 0
# kasd 0 0 1
# lik 1 0 0
# mth 1 0 1
# pas 0 0 1
# ttn 1 0 0
# ttn.1 0 1 0