关于stats::reshape
已经有很多问题,但我发现的任何问题都比函数本身相当难以理解的文档更容易理解。 One user talks of migraines,another tried to make a more sensible wrapper。我也有关于stats::reshape
的噩梦,但有时似乎没有另一种聪明的方式。
这是我目前的情况。我正在尝试模拟一些数据:
#parameters
N = 5
A_trait_mean = 100
B_trait_mean = 90
C_trait_mean = 95
C_admix_A_mean = 50
C_admix_A_SD = 20
trait_SD = 10
#generate A & B
set.seed(123)
{
df = data.frame(trait_A = rnorm(N, A_trait_mean, trait_SD),
trait_B = rnorm(N, B_trait_mean, trait_SD),
admixA_C = rnorm(N, C_admix_A_mean, C_admix_A_SD))
}
#clip C values -- more than 100% admixture is impossible
df$admixA_C[df$admixA_C > 100] = 100
df$admixA_C[df$admixA_C < 0] = 0
#mate
for (row_i in 1:nrow(df)) {
tmp_admix_A = df$admixA_C[row_i]/100
df$trait_C[row_i] = df$trait_A[row_i] * tmp_admix_A + df$trait_B[row_i] * (1 - tmp_admix_A)
}
这给出了:
> df
trait_A trait_B admixA_C trait_C
1 94.39524 107.15065 74.48164 97.65021
2 97.69823 94.60916 57.19628 96.37599
3 115.58708 77.34939 58.01543 99.53315
4 100.70508 83.13147 52.21365 92.30730
5 101.29288 85.54338 38.88318 91.66729
到目前为止一切顺利。但是,要绘制这样的数据,ggplot2需要长格式。但是,我们需要当前的4列成为3个新列,而不仅仅是2.我们需要:组列,特征列和admixA列。
以下是我提出的建议:
> df_long = melt(df)
Using as id variables
> df_long
variable value
1 trait_A 94.39524
2 trait_A 97.69823
3 trait_A 115.58708
4 trait_A 100.70508
5 trait_A 101.29288
6 trait_B 107.15065
7 trait_B 94.60916
8 trait_B 77.34939
9 trait_B 83.13147
10 trait_B 85.54338
11 admixA_C 74.48164
12 admixA_C 57.19628
13 admixA_C 58.01543
14 admixA_C 52.21365
15 admixA_C 38.88318
16 trait_C 97.65021
17 trait_C 96.37599
18 trait_C 99.53315
19 trait_C 92.30730
20 trait_C 91.66729
这是非常错误的。 admixA值混合在一起,没有分组变量。所以我期待stats::reshape
,因为我对它能够处理这类问题有一些模糊的记忆。
所以我阅读了文档并尝试了一下:
df_long = reshape(df, varying = 1:4, sep = "_", direction = "long")
# 'varying' arguments must be the same length
没有运气和相当神秘的错误。同样的长度......是什么?
然后我谷歌并看到有人说应该使用list
。
> df_long = reshape(df, varying = list(1:4), sep = "_", direction = "long")
> df_long
time trait_A id
1.1 1 94.39524 1
2.1 1 97.69823 2
3.1 1 115.58708 3
4.1 1 100.70508 4
5.1 1 101.29288 5
1.2 2 107.15065 1
2.2 2 94.60916 2
3.2 2 77.34939 3
4.2 2 83.13147 4
5.2 2 85.54338 5
1.3 3 74.48164 1
2.3 3 57.19628 2
3.3 3 58.01543 3
4.3 3 52.21365 4
5.3 3 38.88318 5
1.4 4 97.65021 1
2.4 4 96.37599 2
3.4 4 99.53315 3
4.4 4 92.30730 4
5.4 4 91.66729 5
没有错误,但是组变量是错误的,它给了值var(应该只是“trait”)一个错误的名称,并包含了admixA的值。
也许......
df_long = reshape(df, varying = list(1:4), sep = "_", direction = "long", v.names = c("trait", "admixA"))
Error in varying[[i]] : subscript out of bounds
猜不是,但可能没有list
?
> df_long = reshape(df, varying = 1:4, sep = "_", direction = "long", v.names = c("trait", "admixA"))
> df_long
time trait admixA id
1.1 1 107.15065 94.39524 1
2.1 1 94.60916 97.69823 2
3.1 1 77.34939 115.58708 3
4.1 1 83.13147 100.70508 4
5.1 1 85.54338 101.29288 5
1.2 2 97.65021 74.48164 1
2.2 2 96.37599 57.19628 2
3.2 2 99.53315 58.01543 3
4.2 2 92.30730 52.21365 4
5.2 2 91.66729 38.88318 5
我们越来越近了。但是,数据错误且混合,没有正确的分组变量。
所以在尝试了其他一些或多或少盲目的东西后,我放弃并使用手动,多步骤的方法:
#give up and use multiple rounds of melt() manually
df_long = melt(df[c(1, 2, 4)], variable_name = "group")
#fix group variable
df_long$group = str_match(df_long$group, "_(\\w)")[, 2]
#move "value" to "trait"
df_long$trait = df_long$value; df_long$value = NULL
#add admixA
df_long$admixA = c(rep(NA, 2 * nrow(df)), df$admixA_C)
给出了:
> df_long
group trait admixA
1 A 94.39524 NA
2 A 97.69823 NA
3 A 115.58708 NA
4 A 100.70508 NA
5 A 101.29288 NA
6 B 107.15065 NA
7 B 94.60916 NA
8 B 77.34939 NA
9 B 83.13147 NA
10 B 85.54338 NA
11 C 97.65021 74.48164
12 C 96.37599 57.19628
13 C 99.53315 58.01543
14 C 92.30730 52.21365
15 C 91.66729 38.88318
乌拉!但是,如何实际使用stats::reshape
来获得该结果呢?
应该用什么样的更好的非偏头痛诱导解决方案来进行这种转变呢?
答案 0 :(得分:2)
reshape
可以为您排序:
reshape(df, direction="long", varying=c(1,2,4), sep="_", timevar="group")
# admixA_C group trait id
#1.A 74.48164 A 94.39524 1
#2.A 57.19628 A 97.69823 2
#3.A 58.01543 A 115.58708 3
#4.A 52.21365 A 100.70508 4
#5.A 38.88318 A 101.29288 5
#1.B 74.48164 B 107.15065 1
#2.B 57.19628 B 94.60916 2
#...
您没有idvar
,因此请勿指定。{1}}。
varying
仅列出需要制作的3列"long"
指定sep="_"
表示reshape
会猜测time
变量的类别,而您的3个相关"trait_A|B|C"
变量都代表"trait"
。
timevar="group"
只是为您的分组变量提供了适当的标签。
由于admixA_C
没有变化,因此只会在剩下的列中重复。
您可以抛弃id
列,因为您现在不需要它。
答案 1 :(得分:1)
使用melt(...)
时也是如此,但由于您为列命名的方式,reshape(...)
解决方案可能会更短。
library(reshape2)
df.melt <- melt(df, measure.vars=c(1,2,4), variable.name="group", value.name="trait")
df.melt <- transform(df.melt, group=gsub("^.*_","",group))
df.melt <- transform(df.melt, admixA_C=ifelse(group=="C",admixA_C,NA))
df.melt
# admixA_C group trait
# 1 NA A 94.39524
# 2 NA A 97.69823
# 3 NA A 115.58708
# 4 NA A 100.70508
# 5 NA A 101.29288
# 6 NA B 107.15065
# 7 NA B 94.60916
# 8 NA B 77.34939
# 9 NA B 83.13147
# 10 NA B 85.54338
# 11 74.48164 C 97.65021
# 12 57.19628 C 96.37599
# 13 58.01543 C 99.53315
# 14 52.21365 C 92.30730
# 15 38.88318 C 91.66729
另外,您可以将for
循环替换为:
df$trait_C <- with(df, trait_A*admixA_C/100 + trait_B*(1-admixA_C/100))