当熔化()无法处理时,从宽到长

时间:2015-11-06 06:03:49

标签: r

关于stats::reshape已经有很多问题,但我发现的任何问题都比函数本身相当难以理解的文档更容易理解。 One user talks of migrainesanother tried to make a more sensible wrapper。我也有关于stats::reshape的噩梦,但有时似乎没有另一种聪明的方式。

这是我目前的情况。我正在尝试模拟一些数据:

#parameters
N = 5
A_trait_mean = 100
B_trait_mean = 90
C_trait_mean = 95
C_admix_A_mean = 50
C_admix_A_SD = 20
trait_SD = 10

#generate A & B
set.seed(123)
{
df = data.frame(trait_A = rnorm(N, A_trait_mean, trait_SD),
                trait_B = rnorm(N, B_trait_mean, trait_SD),
                admixA_C = rnorm(N, C_admix_A_mean, C_admix_A_SD))
}


#clip C values -- more than 100% admixture is impossible
df$admixA_C[df$admixA_C > 100] = 100
df$admixA_C[df$admixA_C < 0] = 0

#mate
for (row_i in 1:nrow(df)) {
  tmp_admix_A = df$admixA_C[row_i]/100
  df$trait_C[row_i] = df$trait_A[row_i] * tmp_admix_A + df$trait_B[row_i] * (1 - tmp_admix_A)
}

这给出了:

> df
    trait_A   trait_B admixA_C  trait_C
1  94.39524 107.15065 74.48164 97.65021
2  97.69823  94.60916 57.19628 96.37599
3 115.58708  77.34939 58.01543 99.53315
4 100.70508  83.13147 52.21365 92.30730
5 101.29288  85.54338 38.88318 91.66729

到目前为止一切顺利。但是,要绘制这样的数据,ggplot2需要长格式。但是,我们需要当前的4列成为3个新列,而不仅仅是2.我们需要:组列,特征列和admixA列。

以下是我提出的建议:

> df_long = melt(df)
Using  as id variables
> df_long
   variable     value
1   trait_A  94.39524
2   trait_A  97.69823
3   trait_A 115.58708
4   trait_A 100.70508
5   trait_A 101.29288
6   trait_B 107.15065
7   trait_B  94.60916
8   trait_B  77.34939
9   trait_B  83.13147
10  trait_B  85.54338
11 admixA_C  74.48164
12 admixA_C  57.19628
13 admixA_C  58.01543
14 admixA_C  52.21365
15 admixA_C  38.88318
16  trait_C  97.65021
17  trait_C  96.37599
18  trait_C  99.53315
19  trait_C  92.30730
20  trait_C  91.66729

这是非常错误的。 admixA值混合在一起,没有分组变量。所以我期待stats::reshape,因为我对它能够处理这类问题有一些模糊的记忆。

所以我阅读了文档并尝试了一下:

df_long = reshape(df, varying = 1:4, sep = "_", direction = "long")
# 'varying' arguments must be the same length

没有运气和相当神秘的错误。同样的长度......是什么?

然后我谷歌并看到有人说应该使用list

> df_long = reshape(df, varying = list(1:4), sep = "_", direction = "long")
> df_long
    time   trait_A id
1.1    1  94.39524  1
2.1    1  97.69823  2
3.1    1 115.58708  3
4.1    1 100.70508  4
5.1    1 101.29288  5
1.2    2 107.15065  1
2.2    2  94.60916  2
3.2    2  77.34939  3
4.2    2  83.13147  4
5.2    2  85.54338  5
1.3    3  74.48164  1
2.3    3  57.19628  2
3.3    3  58.01543  3
4.3    3  52.21365  4
5.3    3  38.88318  5
1.4    4  97.65021  1
2.4    4  96.37599  2
3.4    4  99.53315  3
4.4    4  92.30730  4
5.4    4  91.66729  5

没有错误,但是组变量是错误的,它给了值var(应该只是“trait”)一个错误的名称,并包含了admixA的值。

也许......

df_long = reshape(df, varying = list(1:4), sep = "_", direction = "long", v.names = c("trait", "admixA"))
Error in varying[[i]] : subscript out of bounds

猜不是,但可能没有list

> df_long = reshape(df, varying = 1:4, sep = "_", direction = "long", v.names = c("trait", "admixA"))
> df_long
    time     trait    admixA id
1.1    1 107.15065  94.39524  1
2.1    1  94.60916  97.69823  2
3.1    1  77.34939 115.58708  3
4.1    1  83.13147 100.70508  4
5.1    1  85.54338 101.29288  5
1.2    2  97.65021  74.48164  1
2.2    2  96.37599  57.19628  2
3.2    2  99.53315  58.01543  3
4.2    2  92.30730  52.21365  4
5.2    2  91.66729  38.88318  5

我们越来越近了。但是,数据错误且混合,没有正确的分组变量。

所以在尝试了其他一些或多或少盲目的东西后,我放弃并使用手动,多步骤的方法:

#give up and use multiple rounds of melt() manually
df_long = melt(df[c(1, 2, 4)], variable_name = "group")
#fix group variable
df_long$group = str_match(df_long$group, "_(\\w)")[, 2]
#move "value" to "trait"
df_long$trait = df_long$value; df_long$value = NULL
#add admixA
df_long$admixA = c(rep(NA, 2 * nrow(df)), df$admixA_C)

给出了:

> df_long
   group     trait   admixA
1      A  94.39524       NA
2      A  97.69823       NA
3      A 115.58708       NA
4      A 100.70508       NA
5      A 101.29288       NA
6      B 107.15065       NA
7      B  94.60916       NA
8      B  77.34939       NA
9      B  83.13147       NA
10     B  85.54338       NA
11     C  97.65021 74.48164
12     C  96.37599 57.19628
13     C  99.53315 58.01543
14     C  92.30730 52.21365
15     C  91.66729 38.88318

乌拉!但是,如何实际使用stats::reshape来获得该结果呢?

应该用什么样的更好的非偏头痛诱导解决方案来进行这种转变呢?

2 个答案:

答案 0 :(得分:2)

reshape可以为您排序:

reshape(df, direction="long", varying=c(1,2,4), sep="_", timevar="group")
#    admixA_C group     trait id
#1.A 74.48164     A  94.39524  1
#2.A 57.19628     A  97.69823  2
#3.A 58.01543     A 115.58708  3
#4.A 52.21365     A 100.70508  4
#5.A 38.88318     A 101.29288  5
#1.B 74.48164     B 107.15065  1
#2.B 57.19628     B  94.60916  2
#...

您没有idvar,因此请勿指定。{1}}。

varying仅列出需要制作的3列"long"

指定sep="_"表示reshape会猜测time变量的类别,而您的3个相关"trait_A|B|C"变量都代表"trait"

timevar="group"只是为您的分组变量提供了适当的标签。

由于admixA_C没有变化,因此只会在剩下的列中重复。

您可以抛弃id列,因为您现在不需要它。

答案 1 :(得分:1)

使用melt(...)时也是如此,但由于您为列命名的方式,reshape(...)解决方案可能会更短。

library(reshape2)
df.melt <- melt(df, measure.vars=c(1,2,4), variable.name="group", value.name="trait")
df.melt <- transform(df.melt, group=gsub("^.*_","",group))
df.melt <- transform(df.melt, admixA_C=ifelse(group=="C",admixA_C,NA))
df.melt
#    admixA_C group     trait
# 1        NA     A  94.39524
# 2        NA     A  97.69823
# 3        NA     A 115.58708
# 4        NA     A 100.70508
# 5        NA     A 101.29288
# 6        NA     B 107.15065
# 7        NA     B  94.60916
# 8        NA     B  77.34939
# 9        NA     B  83.13147
# 10       NA     B  85.54338
# 11 74.48164     C  97.65021
# 12 57.19628     C  96.37599
# 13 58.01543     C  99.53315
# 14 52.21365     C  92.30730
# 15 38.88318     C  91.66729

另外,您可以将for循环替换为:

df$trait_C <- with(df, trait_A*admixA_C/100 + trait_B*(1-admixA_C/100))