如何跨多列收集/重塑不完整的数据框

时间:2019-09-08 07:54:32

标签: r dataframe dplyr reshape tidyr

我有一个大型数据框,其中包含3个不完整的数据块。我想使用R将数据框从宽格式转换为长格式。

示例

df <- structure(list(V1 = 1234:1240, V2 = structure(1:7, .Label = c("text1","text2", "text3", "text4", "text5", "text6", "text7"), class = "factor"), V3 = structure(c(1L, 1L, 1L, 1L, NA, NA, NA), .Label = "constant1", class = "factor"), V4 = structure(c(1L, 1L, 2L, 3L, NA, NA, NA), .Label = c("VariableA1", "VariableA2", "VariableA3"), class = "factor"), V5 = structure(c(1L, 2L, 1L, 2L, NA, NA, NA), .Label = c("VariableA4", "VariableA5"), class = "factor"), V6 = structure(c(NA, NA, NA, 1L, 1L,NA, NA), .Label = "constant2", class = "factor"), V7 = structure(c(NA, NA, NA, 1L, 2L, NA, NA), .Label = c("VariableB1", "VariableB2"), class = "factor"), V8 = structure(c(NA, NA, 1L, NA, NA, 
    1L, 1L), .Label = "constant3", class = "factor"), V9 = structure(c(NA, NA, 1L, NA, NA, 1L, 2L), .Label = c("VariableC1", "VariableC2"), class = "factor"), V10 = structure(c(NA, NA, 1L, NA, NA, 2L, 1L), .Label = c("VariableC3", "VariableC4"), class = "factor")), class = "data.frame", row.names = c(NA,-7L))

目前的数据看起来像这样

1234    text1   constant1   VariableA1  VariableA4      NA          NA         NA          NA          NA
1235    text2   constant1   VariableA1  VariableA5      NA          NA         NA          NA          NA
1236    text3   constant1   VariableA2  VariableA4      NA          NA      constant3   VariableC1  VariableC3
1237    text4   constant1   VariableA3  VariableA5  constant2   VariableB1     NA          NA          NA
1238    text5       NA          NA          NA      constant2   VariableB2     NA          NA          NA
1239    text6       NA          NA          NA              NA          NA  constant3   VariableC1  VariableC4
1240    text7       NA          NA          NA              NA          NA  constant3   VariableC2  VariableC3

我想要的是

1234    text1   constant1   VariableA1  VariableA4
1235    text2   constant1   VariableA1  VariableA5
1236    text3   constant1   VariableA2  VariableA4
1236    text3   constant3   VariableC1  VariableC3
1237    text4   constant1   VariableA3  VariableA5
1237    text4   constant2   VariableB1  NA
1238    text5   constant2   VariableB2  NA
1239    text6   constant3   VariableC1  VariableC4
1240    text7   constant3   VariableC2  VariableC3

在实际数据中,第1列和第2列中的值与此处包含的值不一致。在第3到10列中,常量和变量可以是1-3个不同的字符值。

This is the closest potential answer I can find so far

1 个答案:

答案 0 :(得分:0)

出于某种原因,我不确定目标是什么。我的输出与您的输出不完全匹配,但我认为它可能具有所需的工具。很高兴进行编辑,但是我看不到为什么最终输出中的两个变量列是分开的。

我用tidyr::gather在一列中获取了所有“变量”,而在一秒中获得了所有“常量”。然后,我使用dplyr::group_bydplyr::summarize来保持变量的唯一组合,但是我敢打赌,您可以在tidyverse中找到一些更优雅的方法。同上,我如何删除NA行(您可能想要也可能不想这样做,但我也不清楚)。

f <- structure(list(V1 = 1234:1240, V2 = structure(1:7, .Label = c("text1","text2", "text3", "text4", "text5", "text6", "text7"), class = "factor"), V3 = structure(c(1L, 1L, 1L, 1L, NA, NA, NA), .Label = "constant1", class = "factor"), V4 = structure(c(1L, 1L, 2L, 3L, NA, NA, NA), .Label = c("VariableA1", "VariableA2", "VariableA3"), class = "factor"), V5 = structure(c(1L, 2L, 1L, 2L, NA, NA, NA), .Label = c("VariableA4", "VariableA5"), class = "factor"), V6 = structure(c(NA, NA, NA, 1L, 1L,NA, NA), .Label = "constant2", class = "factor"), V7 = structure(c(NA, NA, NA, 1L, 2L, NA, NA), .Label = c("VariableB1", "VariableB2"), class = "factor"), V8 = structure(c(NA, NA, 1L, NA, NA, 1L, 1L), .Label = "constant3", class = "factor"), V9 = structure(c(NA, NA, 1L, NA, NA, 1L, 2L), .Label = c("VariableC1", "VariableC2"), class = "factor"), V10 = structure(c(NA, NA, 1L, NA, NA, 2L, 1L), .Label = c("VariableC3", "VariableC4"), class = "factor")), class = "data.frame", row.names = c(NA,-7L))

library(tidyverse)

g<-f %>% gather(origcol, vari, 4,5,6,9,10) %>% #get all the things called "variables" into one column gather(oc2, constant, V3,V8) %>% # the things called "constants" filter(!(is.na(vari)&is.na(constant))) %>% select(-c(origcol, oc2)) %>% #filter out rows where both variable and constant are NA group_by(V1, V2, vari, constant) %>% summarize(dummycol=n()) %>% #might be a more elegant way to do this select(-dummycol)

h<-g[complete.cases(g),] #and probably an idiomatic tidyverse way to drop rows with NA, which you might not want to do quite like this anyways