从R中的数据框重新排列信息

时间:2015-04-28 16:12:24

标签: r reshape

我有以下df,它是从excel文件中获得的:

df1 <- data.frame( Colour = c("Green","Red","Blue"), 
                   Code = c("N","U", "U"), 
                   User1 = c("John","Brad","Peter"), 
                   User2 = c("Meg","Meg","John"), 
                   User3= c("", "Lucy", ""))

我需要重新排列它以获得一个数据框,其中所有名称都列在第一列中(仅一次),颜色(和相应的代码)出现在以下列中,如下所示:

df2 <- data.frame(User=c("John","Brad","Peter","Meg","Lucy"),
                  Color1 = c("Green","Red","Blue","Green","Red"),
                  Code1 = c("N","U","U","N","U"), 
                  Color2=c("Blue","","","Red",""),
                  Code2=c("U","","","U",""))

我感谢一些帮助。非常感谢,

3 个答案:

答案 0 :(得分:4)

由于与@ akrun的回答存在概念上的相似性,我对此发表评论犹豫不决,但你也可以通过我的&#34; splitstackshape&#34;中的merged.stack来做到这一点。与基础R的reshape一起打包。

library(splitstackshape)
reshape(
  getanID(
    merged.stack(df1, var.stubs = "User", sep = "var.stubs")[User != ""], 
    "User"), 
  direction = "wide", idvar = "User", timevar = ".id", drop = ".time_1")
#     User Colour.1 Code.1 Colour.2 Code.2
# 1: Peter     Blue      U       NA     NA
# 2:  John     Blue      U    Green      N
# 3:   Meg    Green      N      Red      U
# 4:  Brad      Red      U       NA     NA
# 5:  Lucy      Red      U       NA     NA

merged.stack使数据变长,getanID创建一个ID变量,用于转换为宽格式时,reshape执行从此半宽形式到宽广的实际转换形式。

这是我能想到的最好的&#34; dplyr&#34; +&#34; tidyr&#34;用户。看起来相当冗长,但不应该太难遵循:

library(dplyr)
library(tidyr)

df1 %>%
  gather(var, User, User1:User3) %>%      # Get the data into a long form
  filter(User != "") %>%                  # Drop empty rows
  group_by(User) %>%                      # Group by User
  mutate(Id = sequence(n())) %>%          # Create a new id variable
  gather(var2, value, Colour, Code) %>%   # Go long a second time
  unite(Key, var2, Id) %>%                # Combine values to create a key
  spread(Key, value, fill = "")           # Convert back to a wide form
# Source: local data frame [6 x 6]
# 
#     var  User Code_1 Code_2 Colour_1 Colour_2
# 1 User1  Brad      U             Red         
# 2 User1  John      N           Green         
# 3 User1 Peter      U            Blue         
# 4 User2  John             U              Blue
# 5 User2   Meg      N      U    Green      Red
# 6 User3  Lucy      U             Red         

答案 1 :(得分:4)

它不漂亮,但这是纯基R中的另一个解决方案,它使用了几个reshape()的调用:

reshape(transform(subset(reshape(df1,varying=grep('^User',names(df1)),dir='l',v.names='User'),User!=''),id=NULL,time=ave(c(User),User,FUN=seq_along),User=factor(User)),dir='w',idvar='User',sep='');
##      User Colour1 Code1 Colour2 Code2
## 1.1  John   Green     N    Blue     U
## 2.1  Brad     Red     U    <NA>  <NA>
## 3.1 Peter    Blue     U    <NA>  <NA>
## 1.2   Meg   Green     N     Red     U
## 2.3  Lucy     Red     U    <NA>  <NA>

答案 2 :(得分:3)

我们可以使用dcast的开发版本中的data.table,即v1.9.5 +。它可能需要多个value.var列。我们将data.frame转换为data.tablesetDT(df1)),将melt数据与id列转换为&#39; Color&#39;和&#39;代码&#39;,删除&#39;用户&#39;不等于&#39;&#39; ([User!='']),根据&#39;用户&#39;创建分组序列。列和dcast。要安装的说明是here

library(data.table)#v1.9.5+
dcast(melt(setDT(df1), id.var=c('Colour', 'Code'), 
           value.name='User')[User!=''][,
              N:=1:.N, User], User~N, value.var=c('Colour', 'Code'))
#    User 1_Colour 2_Colour 1_Code 2_Code
#1:  Brad      Red       NA      U     NA
#2:  John    Green     Blue      N      U
#3:  Lucy      Red       NA      U     NA
#4:   Meg    Green      Red      N      U
#5: Peter     Blue       NA      U     NA

或者@Arun在评论中提到,我们可以使用subset中的dcast参数代替[User!='']

dcast(melt(setDT(df1), id.var=c('Colour', 'Code'), 
             value.name='User')[,N:= 1:.N, User],
       subset=.(User !=''), User~N, value.var=c('Colour', 'Code'))
#    User 1_Colour 2_Colour 1_Code 2_Code
#1:  Brad      Red       NA      U     NA
#2:  John    Green     Blue      N      U
#3:  Lucy      Red       NA      U     NA
#4:   Meg    Green      Red      N      U
#5: Peter     Blue       NA      U     NA