我有一个数据框:
source= c("A", "A", "B")
target = c("B", "C", "C")
source_A = c(5, 5, 6)
target_A = c(6, 7, 7)
source_B = c(10, 10, 11)
target_B = c(11, 12, 12)
c = c(0.5, 0.6, 0.7)
df = data.frame(source, target, source_A, target_A, source_B, target_B, c)
> df
source target source_A target_A source_B target_B c
1 A B 5 6 10 11 0.5
2 A C 5 7 10 12 0.6
3 B C 6 7 11 12 0.7
如何减少此数据框以仅返回唯一源和目标值的值并返回(忽略列c)。
对于值[A B C]
id A B
1 A 5 10
2 B 6 11
3 C 7 12
此刻我做了类似的事情:
df1 <- df[,c("source","source_A", "source_B")]
df2 <- df[,c("target","target_A", "target_B")]
names(df1)[names(df1) == 'source'] <- 'id'
names(df1)[names(df1) == 'source_A'] <- 'A'
names(df1)[names(df1) == 'source_B'] <- 'B'
names(df2)[names(df2) == 'target'] <- 'id'
names(df2)[names(df2) == 'target_A'] <- 'A'
names(df2)[names(df2) == 'target_B'] <- 'B'
df3 <- rbind(df1,df2)
df3[!duplicated(df3$id),]
id A B
1 A 5 10
3 B 6 11
5 C 7 12
实际上,我有数十个专栏,所以这是长期不可行的。
我怎样才能更简洁地做到这一点(理想情况下,更多列可以推广)?
答案 0 :(得分:0)
library(dplyr)
library(magrittr)
df1 <- subset(df, select = ls(pattern = "source"))
df2 <- subset(df, select = ls(pattern = "target"))
names(df1) <- names(df2)
df <- bind_rows(df1, df2)
df %<>% group_by(target, target_A, target_B) %>% slice(1)
这应该这样做,但我不太清楚你想如何概括它。 我不认为这是世界上最优雅的解决方案,但它有助于达到目的。希望您打算使用的列可以通过列名字符串模式进行定位!
答案 1 :(得分:0)
这是一个使用dplyr
函数的更通用的方法。您基本上需要将所有内容收集到一个长格式中,您可以相应地重命名该变量,然后将它们重新分配到id, A, B
:
library(dplyr)
library(tidyr)
df %>%
select(-c) %>%
mutate(index = row_number()) %>%
gather(key , value, -index) %>%
separate(key, c("type", "name"), fill = "right") %>%
mutate(name = ifelse(is.na(name), "id", name)) %>%
spread(key = name, value = value) %>%
select(id, matches("[A-Z]", ignore.case = FALSE)) %>%
distinct