我想基于两个变量对以下数据集(dataGenotype)进行分区;例如,对于基因型H13,基因型和stand_ID:stand_ID编号7可以进行训练,stand_ID编号18和21可以进行测试。
Genotype stand_ID Inventory_date stemC mheight
H13 7 5/18/2006 1940.1075 11.33995
H13 7 11/1/2008 10898.9597 23.20395
H13 7 4/14/2009 12830.1284 23.77395
H13 18 11/3/2005 2726.42 13.4432
H13 18 6/30/2008 12226.1554 24.091967
H13 18 4/14/2009 14141.68 25.0922
H13 21 5/18/2006 4981.7158 15.7173
H13 21 4/14/2009 20327.0667 27.9155
H15 9 3/31/2006 3570.06 14.7898
H15 9 11/1/2008 15138.8383 26.2088
H15 9 4/14/2009 17035.4688 26.8778
H15 20 1/18/2005 3016.881 14.1886
H15 20 10/4/2006 8330.4688 20.19425
H15 20 6/30/2008 13576.5 25.4774
U21 3 1/9/2006 3660.416 15.09925
U21 3 6/30/2008 13236.29 24.27634
U21 3 4/14/2009 16124.192 25.79562
U21 67 11/4/2005 2812.8425 13.60485
U21 67 4/14/2009 13468.455 24.6203
所需的输出如下;
A培训
Genotype stand_ID Inventory_date stemC mheight
H13 7 5/18/2006 1940.1075 11.33995
H13 7 11/1/2008 10898.9597 23.20395
H13 7 4/14/2009 12830.1284 23.77395
H15 9 3/31/2006 3570.06 14.7898
H15 9 11/1/2008 15138.8383 26.2088
H15 9 4/14/2009 17035.4688 26.8778
U21 67 11/4/2005 2812.8425 13.60485
U21 67 4/14/2009 13468.455 24.6203
B测试
Genotype stand_ID Inventory_date stemC mheight
H13 18 11/3/2005 2726.42 13.4432
H13 18 6/30/2008 12226.1554 24.091967
H13 18 4/14/2009 14141.68 25.0922
H13 21 5/18/2006 4981.7158 15.7173
H13 21 4/14/2009 20327.0667 27.9155
H15 20 1/18/2005 3016.881 14.1886
H15 20 10/4/2006 8330.4688 20.19425
H15 20 6/30/2008 13576.5 25.4774
U21 3 1/9/2006 3660.416 15.09925
U21 3 6/30/2008 13236.29 24.27634
U21 3 4/14/2009 16124.192 25.79562
我尝试了以下代码;
library(caret)
clonePartitioning <- createDataPartition(dataGenotype$stand_ID,1,list=F,p=0.2)
train = dataGenotype[clonePartitioning,]
test = dataGenotype[-clonePartitioning,]
也尝试过
createDataPartition(unique(dataGenotype$stand_ID),1,list=F,p=0.2)
它没有产生所需的输出,数据在stand_ID中进行了分区。例如,stand_ID 7的一行用于训练,而stand_ID 7的两行用于测试。如何在stand_ID中按基因型对数据进行分区?
答案 0 :(得分:0)
这是使用dplyr
library(tidyverse)
set.seed(1) #for reproducibility of the split
df %>%
group_by(Genotype) %>% #group data by Genotype
distinct(stand_ID) %>% #filter unqiue stand_ID
sample_frac(.2) %>% #sample these stand_ID's with a fraction of your choice
mutate(data = "test") %>% #labels the samples as test
right_join(df) %>% #right join to original data frame, train samples will be NA
pull(data) %>% #pull the vector with test/NA indeces
is.na -> train_ind #see which ones are NA
df[train_ind,]
Genotype stand_ID Inventory_date stemC mheight
4 H13 18 11/3/2005 2726.420 13.44320
5 H13 18 6/30/2008 12226.155 24.09197
6 H13 18 4/14/2009 14141.680 25.09220
7 H13 21 5/18/2006 4981.716 15.71730
8 H13 21 4/14/2009 20327.067 27.91550
9 H15 9 3/31/2006 3570.060 14.78980
10 H15 9 11/1/2008 15138.838 26.20880
11 H15 9 4/14/2009 17035.469 26.87780
15 H15 32 2/1/2006 3426.253 14.31815
16 U21 3 1/9/2006 3660.416 15.09925
17 U21 3 6/30/2008 13236.290 24.27634
18 U21 3 4/14/2009 16124.192 25.79562
19 U21 67 11/4/2005 2812.843 13.60485
20 U21 67 4/14/2009 13468.455 24.62030
df[!train_ind,]
Genotype stand_ID Inventory_date stemC mheight
1 H13 7 5/18/2006 1940.108 11.33995
2 H13 7 11/1/2008 10898.960 23.20395
3 H13 7 4/14/2009 12830.128 23.77395
12 H15 20 1/18/2005 3016.881 14.18860
13 H15 20 10/4/2006 8330.469 20.19425
14 H15 20 6/30/2008 13576.500 25.47740