随机选择一定比例的行并创建新列

时间:2016-09-29 06:51:52

标签: r

我有一个包含10种名称的物种列。我必须随机将物种分成四列,这样每列都会占据一定比例的物种。

假设第一列占20%,第二列占30%,第三列占40%,后续占10%。这四列将是四种不同的环境,即:

Restricted, Tidalflat, beach, estuary

因此,将预先确定列入口,但选择将是随机的。

我的输入数据如下所示:

species <- c('Natica','Tellina','Mactra','Natica','Arca','Arca','Tellina',
             'Nassarius','Cardium','Cardium')

结果应如下所示:

enter image description here

2 个答案:

答案 0 :(得分:3)

一些简单的设置:

species <- c('Natica','Tellina','Mactra','Natica','Arca','Arca','Tellina',
             'Nassarius','Cardium','Cardium')
rspecies <- sample(species)

envirs <- c('Restricted', 'Tidalflat', 'Beach', 'Estuary')

probs <- c(.2, .3, .4, .1)

nrs <- round(length(species) * probs)

现在,具有单独列的data.frame不是表达数据的好方法,因为您的数据不是矩形,即每列中没有相同数量的观察值。

您可以以长格式显示数据:

df <- data.frame(species = rspecies, envir = rep(envirs, nrs), stringsAsFactors = FALSE)
     species      envir
1    Tellina Restricted
2     Natica Restricted
3       Arca  Tidalflat
4     Mactra  Tidalflat
5    Tellina  Tidalflat
6       Arca      Beach
7  Nassarius      Beach
8    Cardium      Beach
9    Cardium      Beach
10    Natica    Estuary

或者作为清单:

split(rspecies, df$envir)
$Beach
[1] "Mactra" "Natica" "Arca"   "Arca"  

$Estuary
[1] "Tellina"

$Restricted
[1] "Nassarius" "Cardium"  

$Tidalflat
[1] "Cardium" "Natica"  "Tellina"

编辑:

适应不同数量物种的一种方法是根据环境使分配具有概率。实际数据集越大,这将越好。

species2 <- c('Natica','Tellina','Mactra','Natica','Arca','Arca','Tellina',
             'Nassarius','Cardium','Cardium', 'Cardium')
length(species2)
  

[1] 11

grps <- sample(envirs, size = length(species2), prob = probs, replace = TRUE)
df2 <- data.frame(species = species2, envir = grps, stringsAsFactors = FALSE) 
df2 <- df2[order(df2$envir), ]
     species      envir
5       Arca      Beach
10   Cardium      Beach
1     Natica    Estuary
11   Cardium    Estuary
3     Mactra Restricted
7    Tellina Restricted
2    Tellina  Tidalflat
4     Natica  Tidalflat
6       Arca  Tidalflat
8  Nassarius  Tidalflat
9    Cardium  Tidalflat

答案 1 :(得分:1)

也许不在一行代码中。我不理解列部分,但您可以使用下面的内容来创建数据框,但您的列长度不相等。

species <- 1:1000

ranspecies <- sample(species)
 first20 <- ranspecies[1:(floor(length(species)*.20))]
second30 <- ranspecies[(floor(length(species)*.20)+1):(floor(length(species)*.50))]
third40 <- ranspecies[(floor(length(species)*.50)+1):(floor(length(species)*.90))]
forth10 <- ranspecies[(floor(length(species)*.90)+1):length(species)]

或匹配您的示例

species <- c('Natica'
             ,'Tellina'
             ,'Mactra'
             ,'Natica'
             ,'Arca'
             ,'Arca'
             ,'Tellina'
             ,'Nassarius'
             ,'Cardium'
             ,'Cardium')

ranspecies <- sample(species)
first20 <- ranspecies[1:(floor(length(species)*.20))]
second30 <- ranspecies[(floor(length(species)*.20)+1):(floor(length(species)*.50))]
third40 <- ranspecies[(floor(length(species)*.50)+1):(floor(length(species)*.90))]
forth10 <- ranspecies[(floor(length(species)*.90)+1):length(species)]
dflength <- max(length(first20), length(second30), length(third40),length(forth10))
data.frame(f = c(first20,rep(NA,dflength-length(first20)))
           ,s = c(second30,rep(NA,dflength-length(second30)))
           ,t = c(third40,rep(NA,dflength-length(third40)))
           ,f = c(forth10,rep(NA,dflength-length(forth10)))
           )

尽管我觉得有些步骤可以更加紧凑。但我会让你更多地摆弄它。