我想在R中“总结”一个因子变量,因此对于每个记录,我知道存在哪些因子水平。
以下是一个简化的示例数据框:
df <- data.frame(record= c("a","a","b","c","c","c"),
species = c("COD", "SCE", "COD", "COD","SCE","QSC"))
record species
a COD
a SCE
b COD
c COD
c SCE
c QSC
这就是我想要实现的目标:
data.frame(record= c(a,b,c), species = c("COD, SCE", "COD", "COD, SCE, QSC"))
record species
a COD, SCE
b COD
c COD, SCE, QSC
这是我能够得到的最接近的,但是它会将每个记录的所有级别放在每个记录中,而不仅仅是每条记录应该存在的那些级别。
summarise(group_by(df, record),
species = (paste(levels(species), collapse="")))
record species
<fctr> <chr>
a CODQSCSCE <- this should be CODSCE
b CODQSCSCE <- this should just be COD
c CODQSCSCE <- this is correct as CODQSCSCE as it has all levels
tapply返回相同的问题
tapply(df$species, df$record, function(x) paste(levels(x), collapse=""))
a b c
"CODQSCSCE" "CODQSCSCE" "CODQSCSCE"
我需要找到一种方法来判断每条记录中存在哪些物种因子组合。
感谢您的帮助!
答案 0 :(得分:4)
使用unique()
:
library(dplyr)
df %>%
group_by(site) %>%
summarise(species = paste(unique(species), collapse = ', '))
# A tibble: 3 x 2
site species
<fctr> <chr>
1 a COD, SCE
2 b COD
3 c COD, SCE, QSC
答案 1 :(得分:0)
您可以使用基础R aggregate
:
aggregate(species ~ record, data = df, paste, collapse = ",")
如果您想要dplyr
包解决方案:
df %>%
group_by(record) %>%
summarise(species = paste(species, collapse = ","))
如果你想使用data.table
包(感谢setDT
的@PLapointe):
library(data.table)
setDT(df)[ , .(species = list(species)), by = record]
<强> ñ。 B. 如果您不想复制,只需在使用上述任何解决方案之前应用df <- unique(df)
;
输出将是:
# record species
# 1: a COD,SCE
# 2: b COD
# 3: c COD,SCE,QSC