我有以下数据框:
structure(list(Store = c("vpm", "vpm",
"vpm"), Date = structure(c(18042, 18042, 18042), class = "Date"),
UniqueImageId = c("vp3_523", "vp3_668", "vp3_523"), EntryTime = structure(c(1558835514,
1558834942, 1558835523), class = c("POSIXct", "POSIXt")),
ExitTime = structure(c(1558838793, 1558838793, 1558839824
), class = c("POSIXct", "POSIXt")), Duration = c(3279, 3851,
4301), Age = c(35L, 35L, 35L), EntryPoint = c("Entry2Side",
"Entry2Side", "Entry2Side"), ExitPoint = c("Exit2Side", "Exit2Side",
"Exit2Side"), AgeNew = c("15_20", "25_32", "15_20"), GenderNew = c("Female",
"Male", "Female")), row.names = 4:6, class = c("data.table",
"data.frame"))
我正在尝试为AgeNew
列填充一个随机数,并且我使用的sample
函数具有ifelse条件。
我尝试了以下
d$AgeNew <- ifelse(d$AgeNew == "0_2", sample(0:2, 1,replace = TRUE),
ifelse(d$AgeNew == "15_20", sample(15:20,1,replace = TRUE),
ifelse(d$AgeNew == "25_32", sample(25:36,1,replace = TRUE),
ifelse(d$AgeNew == "38_43", sample(36:43,1,replace = TRUE),
ifelse(d$AgeNew == "4_6", sample(4:6, 1,replace = TRUE),
ifelse(d$AgeNew == "48_53", sample(48:53,1,replace = TRUE),
ifelse(d$AgeNew == "60_Inf",sample(60:65,1,replace = TRUE),
sample(8:13, 1,replace = TRUE))))))))
但是我得到相同的价值,不断重复。例如,对于0_2岁年龄段,我只有2岁。我尝试使用set.seed
set.seed(123)
然后仍然运行ifelse,它将重复相同的值。
答案 0 :(得分:3)
对此已在某处进行了讨论(目前无法找到源)。之所以如此,是因为ifelse
仅针对一个条件运行一次,因此该值被回收。考虑这个例子,
x <- c(1, 2, 1, 2, 1, 2)
ifelse(x == 1, sample(1:10, 1), sample(20:30, 1))
#[1] 1 26 1 26 1 26
ifelse(x == 1, sample(1:10, 1), sample(20:30, 1))
#[1] 10 28 10 28 10 28
ifelse(x == 1, sample(1:10, 1), sample(20:30, 1))
#[1] 9 24 9 24 9 24
我们可以看到,它给出了相同的数字,可以在两种情况下循环使用。为避免这种情况,我们需要将size
中的sample
指定为test
ifelse
条件的长度
ifelse(x == 1, sample(1:10, length(x)), sample(20:30, length(x)))
#[1] 7 23 1 26 10 24
ifelse(x == 1, sample(1:10, length(x)), sample(20:30, length(x)))
#[1] 3 23 5 26 6 22
ifelse(x == 1, sample(1:10, length(x)), sample(20:30, length(x)))
#[1] 2 30 9 27 1 29
答案 1 :(得分:1)
一个更简单的选择是将_
替换为:
,并将eval
uate和sample
替换为该范围内的元素
library(data.table)
d[, AgeNew := sapply(sub("_", ":", sub('Inf', '65', AgeNew)),
function(x) sample(eval(parse(text = x)), 1))]
d[is.na(AgeNew), AgeNew := sample(8:13, 1)]
d
# Store Date UniqueImageId EntryTime ExitTime Duration Age EntryPoint ExitPoint AgeNew GenderNew
#1: vpm 2019-05-26 vp3_523 2019-05-25 21:51:54 2019-05-25 22:46:33 3279 35 Entry2Side Exit2Side 15 Female
#2: vpm 2019-05-26 vp3_668 2019-05-25 21:42:22 2019-05-25 22:46:33 3851 35 Entry2Side Exit2Side 30 Male
#3: vpm 2019-05-26 vp3_523 2019-05-25 21:52:03 2019-05-25 23:03:44 4301 35 Entry2Side Exit2Side 17 Female
或带有tidyverse
library(tidyverse)
d %>%
mutate(AgeNew = str_replace(AgeNew, "Inf", "65")) %>%
separate(AgeNew, into = c('start', 'end'), convert = TRUE) %>%
mutate(AgNew = map2_int(start, end, ~ sample(.x:.y, 1)))
或者另一个选择是将_
分割,然后采样
d[, AgeNew := unlist(lapply(strsplit(sub('Inf', '65', AgeNew), "_"), function(x)
sample(as.numeric(x[1]):as.numeric(x[2]), 1)))]
请注意,我们不需要任何嵌套的ifelse
即可在此处进行更改。如果没有任何ifelse
注意2:OP以data.table
为例,在这里我们展示了data.table
方法
注3:使用嵌套ifelse效率很低
注释4:基于strsplit
的方法首先在此处发布
关于ifelse
为何工作不同的原因,?ifelse
的文档中已经提到了
如果是或否太短,则将其元素回收。当且仅当测试的任何一个要素为真,并且类似地为否,才会评估是。
答案 2 :(得分:0)
您将需要处理Inf
。从您的示例中,假设出现+5
,则假定您要添加Inf
。因此,基于该假设,我们可以做到
sapply(strsplit(d$AgeNew, '_'), function(i){
sample(i[1]:replace(i[2], i[2] == 'Inf', as.numeric(i[1]) + 5), 1)
})
#[1] 60 32 19
注意::为了测试,我将AgeNew
的第一项更改为60_Inf