我有一个大型数据框,其中包含3个不完整的数据块。我想使用R将数据框从宽格式转换为长格式。
示例
df <- structure(list(V1 = 1234:1240, V2 = structure(1:7, .Label = c("text1","text2", "text3", "text4", "text5", "text6", "text7"), class = "factor"), V3 = structure(c(1L, 1L, 1L, 1L, NA, NA, NA), .Label = "constant1", class = "factor"), V4 = structure(c(1L, 1L, 2L, 3L, NA, NA, NA), .Label = c("VariableA1", "VariableA2", "VariableA3"), class = "factor"), V5 = structure(c(1L, 2L, 1L, 2L, NA, NA, NA), .Label = c("VariableA4", "VariableA5"), class = "factor"), V6 = structure(c(NA, NA, NA, 1L, 1L,NA, NA), .Label = "constant2", class = "factor"), V7 = structure(c(NA, NA, NA, 1L, 2L, NA, NA), .Label = c("VariableB1", "VariableB2"), class = "factor"), V8 = structure(c(NA, NA, 1L, NA, NA,
1L, 1L), .Label = "constant3", class = "factor"), V9 = structure(c(NA, NA, 1L, NA, NA, 1L, 2L), .Label = c("VariableC1", "VariableC2"), class = "factor"), V10 = structure(c(NA, NA, 1L, NA, NA, 2L, 1L), .Label = c("VariableC3", "VariableC4"), class = "factor")), class = "data.frame", row.names = c(NA,-7L))
目前的数据看起来像这样
1234 text1 constant1 VariableA1 VariableA4 NA NA NA NA NA
1235 text2 constant1 VariableA1 VariableA5 NA NA NA NA NA
1236 text3 constant1 VariableA2 VariableA4 NA NA constant3 VariableC1 VariableC3
1237 text4 constant1 VariableA3 VariableA5 constant2 VariableB1 NA NA NA
1238 text5 NA NA NA constant2 VariableB2 NA NA NA
1239 text6 NA NA NA NA NA constant3 VariableC1 VariableC4
1240 text7 NA NA NA NA NA constant3 VariableC2 VariableC3
我想要的是
1234 text1 constant1 VariableA1 VariableA4
1235 text2 constant1 VariableA1 VariableA5
1236 text3 constant1 VariableA2 VariableA4
1236 text3 constant3 VariableC1 VariableC3
1237 text4 constant1 VariableA3 VariableA5
1237 text4 constant2 VariableB1 NA
1238 text5 constant2 VariableB2 NA
1239 text6 constant3 VariableC1 VariableC4
1240 text7 constant3 VariableC2 VariableC3
在实际数据中,第1列和第2列中的值与此处包含的值不一致。在第3到10列中,常量和变量可以是1-3个不同的字符值。
答案 0 :(得分:0)
出于某种原因,我不确定目标是什么。我的输出与您的输出不完全匹配,但我认为它可能具有所需的工具。很高兴进行编辑,但是我看不到为什么最终输出中的两个变量列是分开的。
我用tidyr::gather
在一列中获取了所有“变量”,而在一秒中获得了所有“常量”。然后,我使用dplyr::group_by
和dplyr::summarize
来保持变量的唯一组合,但是我敢打赌,您可以在tidyverse中找到一些更优雅的方法。同上,我如何删除NA行(您可能想要也可能不想这样做,但我也不清楚)。
f <- structure(list(V1 = 1234:1240, V2 = structure(1:7, .Label = c("text1","text2", "text3", "text4", "text5", "text6", "text7"), class = "factor"), V3 = structure(c(1L, 1L, 1L, 1L, NA, NA, NA), .Label = "constant1", class = "factor"), V4 = structure(c(1L, 1L, 2L, 3L, NA, NA, NA), .Label = c("VariableA1", "VariableA2", "VariableA3"), class = "factor"), V5 = structure(c(1L, 2L, 1L, 2L, NA, NA, NA), .Label = c("VariableA4", "VariableA5"), class = "factor"), V6 = structure(c(NA, NA, NA, 1L, 1L,NA, NA), .Label = "constant2", class = "factor"), V7 = structure(c(NA, NA, NA, 1L, 2L, NA, NA), .Label = c("VariableB1", "VariableB2"), class = "factor"), V8 = structure(c(NA, NA, 1L, NA, NA, 1L, 1L), .Label = "constant3", class = "factor"), V9 = structure(c(NA, NA, 1L, NA, NA, 1L, 2L), .Label = c("VariableC1", "VariableC2"), class = "factor"), V10 = structure(c(NA, NA, 1L, NA, NA, 2L, 1L), .Label = c("VariableC3", "VariableC4"), class = "factor")), class = "data.frame", row.names = c(NA,-7L))
library(tidyverse)
g<-f %>% gather(origcol, vari, 4,5,6,9,10) %>% #get all the things called "variables" into one column
gather(oc2, constant, V3,V8) %>% # the things called "constants"
filter(!(is.na(vari)&is.na(constant))) %>% select(-c(origcol, oc2)) %>% #filter out rows where both variable and constant are NA
group_by(V1, V2, vari, constant) %>%
summarize(dummycol=n()) %>% #might be a more elegant way to do this
select(-dummycol)
h<-g[complete.cases(g),] #and probably an idiomatic tidyverse way to drop rows with NA, which you might not want to do quite like this anyways