如何通过匹配跨数据帧的多个列来编译字符串列表?

时间:2015-02-06 18:27:19

标签: r

我遇到了一个非常具有挑战性的问题,并且不知道如何处理它。 (我甚至不确定我是否正确地命名了该帖子。)无论如何,我有两个数据帧df1df2

df1 <- structure(list(country = structure(c(1L, 1L, 2L, 3L), .Label = c("a", 
"b", "c"), class = "factor"), state = structure(1:4, .Label = c("d", 
"m", "o", "q"), class = "factor"), city = structure(1:4, .Label = c("h", 
"n", "p", "r"), class = "factor"), value = c(1L, 3L, 3L, 4L), 
    source = structure(1:4, .Label = c("string1", "string2", 
    "string3", "string4"), class = "factor")), .Names = c("country", 
"state", "city", "value", "source"), class = "data.frame", row.names = c(NA, 
-4L))


df2 <- structure(list(country = structure(c(1L, 1L, 2L, 3L), .Label = c("a", 
"b", "c"), class = "factor"), state = structure(1:4, .Label = c("d", 
"e", "f", "g"), class = "factor"), city = structure(1:4, .Label = c("h", 
"i", "j", "k"), class = "factor"), mean_value = 1:4, level_of_mean = structure(c(1L, 
2L, 2L, 2L), .Label = c("city", "country"), class = "factor")), .Names = c("country", 
"state", "city", "mean_value", "level_of_mean"), class = "data.frame", row.names = c(NA, 
-4L))

两个数据框都包含各个国家/地区,州和城市的数据。数据帧df1包含&#34; raw&#34;数据,df2包含根据数据可用性(城市级别,州级别或国家级别均值)从df1各个级别(国家,州和城市)的值计算的数据,按照优先顺序)。

我需要做的是:对于mean_value中的每个df2,我需要使用关联的level_of_meancountrystate和要city查看df1并使用countrystatecity source,请构建列source <- structure(1:4, .Label = c("string1", "string2", "string3", "string4" ), class = "factor") 中的字符串列表。对于上面的数据帧,这将产生以下结果:

mean_value

有没有人知道如何处理这个问题,坦白说我甚至不确定从哪里开始!

编辑:我还应该注意到我的真实&#34;数据框包含许多不同的level_of_mean和{{1}}列,因此一般解决方案最佳。

1 个答案:

答案 0 :(得分:0)

library(dplyr)
library(tidyr)

df2 %>%
  select(-mean_value) %>%
  gather(level, value, -level_of_mean) %>%
  filter(as.character(level_of_mean) == as.character(level)) %>%
  select(-level_of_mean) %>%
  inner_join(df1 %>% 
               gather(level, value, -source)
             ) %>%
  select(source) %>%
  distinct() %>%
  unlist(use.names=F) %>%
  as.character()

您会看到一些关于因素水平在国家/地区,州和城市之间不一致的警告,您可以放心地忽略或禁止它们。

如果您不熟悉连锁或管道运营商%>%,则由dplyr实施。基本上x %>% f(y)f(x, y)

相同

但如果没有那个操作员,你可以做同样的事情:

df2 <- select(df2, -mean_value)
df2 <- gather(df2, level, value, -level_of_mean)
df2 <- filter(df2, ...

这些是dplyr个函数。如果您愿意,可以使用reshape2::melt代替gathermerge代替inner_joinunique代替distinct执行相同的操作