如何将数据从宽格式排列为长格式,并指定关系

时间:2015-11-09 06:08:56

标签: r reshape

目前我有一个文件,我需要从宽格式转换为长格式。数据的例子是:

Subject,Cat1_Weight,Cat2_Weight,Cat3_Weight,Cat1_Sick,Cat2_Sick,Cat3_Sick
1,10,11,12,1,0,0
2,7,8,9,1,0,0

但是,我需要长格式如下

Subject,CatNumber,Weight,Sickness
1,1,10,1
1,2,11,0
1,3,12,0
2,1,7,1
2,2,8,0
2,3,9,0

到目前为止,我已尝试在R中使用融合功能

datalong <- melt(exp2_simon_shortform, id ="Subject")

但它将每个列名称视为一个唯一的变量,每个变量都有自己的值。有没有人知道如何从指定的宽到长,引用列标题名称?

干杯。

编辑:我意识到我犯了错误。我的最终输出需要如下。因此,从Cat1_部分,我实际上需要离开&#34; Cat&#34;和&#34; 1&#34;

Subject Animal  CatNumber   Weight  Sickness
1   Cat 1   10  1
1   Cat 2   11  0
1   Cat 3   12  0
2   Cat 1   7   1
2   Cat 2   8   0
2   Cat 3   9   0

非常感谢任何更新的解决方案。

3 个答案:

答案 0 :(得分:4)

“dplyr”+“tidyr”方法可能类似于:

library(dplyr)
library(tidyr)
mydf %>%
  gather(var, val, -Subject) %>%
  separate(var, into = c("CatNumber", "variable")) %>%
  spread(variable, val) 
#   Subject CatNumber Sick Weight
# 1       1      Cat1    1     10
# 2       1      Cat2    0     11
# 3       1      Cat3    0     12
# 4       2      Cat1    1      7
# 5       2      Cat2    0      8
# 6       2      Cat3    0      9

在其中添加mutate以及gsub以删除“CatNumber”列的“Cat”部分。

更新

根据the discussions in chat,您的数据实际上看起来更像是:

A = c("ATCint", "Blank", "None"); B = 1:5; C = c("ResumptionTime", "ResumptionMisses")

colNames <- expand.grid(A, B, C)
colNames <- sprintf("%s%d_%s", colNames[[1]], colNames[[2]], colNames[[3]])

subject = 1:60

set.seed(1)
M <- matrix(sample(10, length(subject) * length(colNames), TRUE), 
            nrow = length(subject), dimnames = list(NULL, colNames))

mydf <- data.frame(Subject = subject, M)

因此,您需要执行一些额外的步骤来获得所需的输出。尝试:

library(dplyr)
library(tidyr)
mydf %>% 
  group_by(Subject) %>%                    ## Your ID variable
  gather(var, val, -Subject) %>%           ## Make long data. Everything except your IDs
  separate(var, into = c("partA", "partB")) %>%  ## Split new column into two parts
  mutate(partA = gsub("(.*)([0-9]+)", "\\1_\\2", partA)) %>% ## Make new col easy to split
  separate(partA, into = c("A1", "A2")) %>%                  ## Split this new column
  spread(partB, val)                                         ## Transform to wide form

哪个收益率:

Source: local data frame [900 x 5]

   Subject     A1    A2 ResumptionMisses ResumptionTime
     (int)  (chr) (chr)            (int)          (int)
1        1 ATCint     1                9              3
2        1 ATCint     2                4              3
3        1 ATCint     3                2              2
4        1 ATCint     4                7              4
5        1 ATCint     5                7              1
6        1  Blank     1                4             10
7        1  Blank     2                2              4
8        1  Blank     3                7              5
9        1  Blank     4                1              9
10       1  Blank     5               10             10
..     ...    ...   ...              ...            ...

答案 1 :(得分:3)

我们可以使用melt中的library(data.table) patterns measure变量{/ 1}}。

library(data.table)#v1.9.6+
DT <- melt(setDT(df1), measure=patterns('Weight$', 'Sick$'), 
            variable.name='CatNumber', value.name=c('Weight', 'Sick'))[order(Subject)]
DT 
#   Subject CatNumber Weight Sick
#1:       1         1     10    1
#2:       1         2     11    0
#3:       1         3     12    0
#4:       2         1      7    1
#5:       2         2      8    0
#6:       2         3      9    0

如果我们需要“动物”列,我们可以grep代表“Cat”列,并使用sub删除后缀子字符串,指定(:=)它以创建“动物” '专栏。

DT[, Animal := sub('\\d+\\_.*', '', grep('Cat', colnames(df1), value=TRUE))]

DT
#   Subject CatNumber Weight Sick Animal
#1:       1         1     10    1    Cat
#2:       1         2     11    0    Cat
#3:       1         3     12    0    Cat
#4:       2         1      7    1    Cat
#5:       2         2      8    0    Cat
#6:       2         3      9    0    Cat

答案 2 :(得分:3)

您可以使用基座reshape执行此操作,例如:

reshape(dat, idvar="Subject", direction="long", varying=list(2:4,5:7),
        v.names=c("Weight","Sick"), timevar="CatNumber")

#    Subject CatNumber Weight Sick
#1.1       1         1     10    1
#2.1       2         1      7    1
#1.2       1         2     11    0
#2.2       2         2      8    0
#1.3       1         3     12    0
#2.3       2         3      9    0

或者,由于reshape需要variablename_groupname之类的名称,您可以更改名称,然后重新塑造以进行艰苦的工作:

names(dat) <- gsub("Cat(.+)_(.+)", "\\2_\\1", names(dat))
reshape(dat, idvar="Subject", direction="long", varying=-1, 
        sep="_", timevar="CatNumber")

#    Subject CatNumber Weight Sick
#1.1       1         1     10    1
#2.1       2         1      7    1
#1.2       1         2     11    0
#2.2       2         2      8    0
#1.3       1         3     12    0
#2.3       2         3      9    0