过滤来自另一个数据帧的两部分响应数据并加入两个数据帧

时间:2018-05-09 16:02:23

标签: r dataframe tidyr data-cleaning

我有一个调查问题的格式:"你喜欢玫瑰还是郁金香?想象一下,玫瑰有V1和V2的颜色,郁金香有颜色V3和V4"

实际颜色是从一个数据框中包含的组合中提取的:

数据帧1(df1):

structure(list(V1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("red", "ruby"), class = "factor"), 
V2 = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L), .Label = c("blue", "violet"), class = "factor"), 
V3 = structure(c(1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 
2L, 2L, 1L, 1L, 2L, 2L), .Label = c("green", "turqoise"), class = "factor"), 
V4 = structure(c(2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 
2L, 1L, 2L, 1L, 2L, 1L), .Label = c("black", "yellow"), class = "factor")), .Names = c("V1", 
"V2", "V3", "V4"), class = "data.frame", row.names = c(NA, -16L
))

在该数据帧(df1)中,前两列(V1和V2)对应于" rose",最后两列(V3和V4)对应于" tulip"。例如,可以向受访者显示来自df1的第一行的组合1,其为"红色蓝绿色黄色"。这意味着受访者可以选择红色和蓝色的玫瑰。或者是一种绿色和黄色的郁金香"。

受访者做出的选择包含在一个单独的数据框(df2)中。 df2每种颜色组合都有一列。如果响应者1显示来自df1的第一个组合("红色蓝绿色黄色")并选择郁金香(绿色和黄色),则选择标记为" 2" (对于郁金香,即第二朵花)在df2的第一行中。如果响应者2显示来自df1的第二个组合("红色蓝绿色黑色")并选择一个玫瑰(红色和蓝色),则选择标记为" 1" (对于玫瑰,即第一朵花)在第二排df2中。换句话说," 2"意味着郁金香的选择,玫瑰没有被选中"和1"意味着选择玫瑰,选择郁金香"。

数据帧2(df2):

structure(list(respondentID = 1:16, v1 = c(2L, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), v2 = c(NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), v3 = c(NA, 
NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA), 
    v4 = c(NA, NA, NA, 2L, NA, NA, NA, NA, NA, NA, 1L, 2L, NA, 
    NA, NA, NA), v5 = c(NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA), v6 = c(NA, 2L, NA, NA, NA, NA, NA,
    NA, NA, 1L, NA, NA, NA, NA, NA, NA), v7 = c(NA, NA, NA, NA, 
    1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), v8 = c(NA, 
    NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
    ), v9 = c(NA, NA, NA, NA, NA, NA, NA, 2L, NA, NA, NA, NA, 
    NA, NA, NA, NA), v10 = c(NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA), v11 = c(NA, NA, NA, NA, 
    NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA), v12 = c(NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA
    ), v13 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, 1L, NA, NA), v14 = c(NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA), v15 = c(NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), v16 = c(NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2L
    )), .Names = c("respondentID", "v1", "v2", "v3", "v4", "v5",
"v6", "v7", "v8", "v9", "v10", "v11", "v12", "v13", "v14", "v15", 
"v16"), class = "data.frame", row.names = c(NA, -16L))

如果我只想知道选择了哪种花和颜色,我可以使用:

df1_with_id <- df1 %>% 
  setNames(paste0("color", 1:4)) %>%
  mutate(combo = paste0("v", row_number()))

result_df <- df2 %>%
  gather(key = combo, value = val, -respondentID) %>%
  filter(!is.na(val)) %>%
  left_join(df1_with_id, by = "combo") %>%
  arrange(respondentID)

(As per this question)

但这并没有给我我需要的格式。我需要有关两个选项的信息(即&#34;上升为V1和V2&#34;以及&#34;郁金香,即V3和V4&#34;)显示给每个受访者的单独行和一个表示选择的附加变量两个选项之间,如下所示: Desired result

(在图像中,&#34; 1&#34;在选择变量中指的是被调查者选择的选项,&#34; 0&#34;是未选择的选项。)

我无法弄清楚如何编写代码来以这种方式组织数据。有什么建议吗?

1 个答案:

答案 0 :(得分:1)

这里的主要问题是df1中的每一列都表示两位信息:花型和颜色编号。因此,重命名它们以包含两个信息位,将它们收集到一列中,将关键列分隔为flowercolor列,然后展开color列。然后,如果val1列匹配,则只需将flower转换为0,否则转换为df2 %>% gather(key = combo, value = val, -respondentID) %>% filter(!is.na(val)) %>% left_join(df1_with_id, by = "combo") %>% arrange(respondentID) %>% rename(rose_color1 = color1, rose_color2 = color2, tulip_color1 = color3, tulip_color2 = color4) %>% gather(color, value, rose_color1:tulip_color2) %>% separate(color, into = c('flower', 'color')) %>% spread(color, value) %>% mutate(val = if_else(val == 1, 'rose', 'tulip')) %>% mutate(val = if_else(val == flower, 1, 0)) %>% select(respondentID, flower, color1, color2, choice = val)

{{1}}