我不知道为什么这不起作用。我已经尝试了各种各样的方式,它只是不起作用。这并不是因为我自己使用if语句而得到错误,但它并不适用。
基本上,有一列Data$Age
和一列Data$Age2
。
如果Data$Age
的值为50 - 100,我希望Data$Age2
为该特定行说“50 - 100年”。
同样,如果Data$Age
为25-50,我希望Data$Age2
为其适用的行说“25-50岁”。
在R中这样做最干净的方法是什么?
答案 0 :(得分:2)
dplyr可能拥有最清晰的解决方案
使用Len Greski的示例数据......
ifelse()
假设您只想要列的两个值。 data%>%
mutate(Age2 = cut(Age1,c(24,50,100),c("25-50 years","51-100 Years")))
对两场以上的比赛效率不高,比如100。如果没有,我将不得不考虑另一种方法。
修改强> 或者正如Len在下面提到的那样,在评论中。
{{1}}
答案 1 :(得分:1)
到目前为止,Len Greski和InfiniteFlashChess首先发布的所有答案都建议对每个年龄段使用重复的子集语句或重复调用ifelse()
。
cut()
函数。
在这里,我建议使用另一个数据驱动的解决方案,该解决方案使用查找表,其中包含年龄范围的下限和上限以及相关标签,以及在非equi连接中更新。这将允许我们指定任意数量的范围而无需对代码进行任何更改:
library(data.table)
# define lookup table
lookup <- data.table(
lower = c(25L, 51L),
upper = c(50L, 100L)
)
lookup[, label := sprintf("%i-%i Years", lower, upper)][]
lower upper label 1: 25 50 25-50 Years 2: 51 100 51-100 Years
# create sample data set
Data <- data.frame(Age = c(24:26, 49:52, 100:102))
# update in non-equi join
setDT(Data)[lookup, on =.(Age >= lower, Age <= upper), Age2 := label][]
Age Age2 1: 24 NA 2: 25 25-50 Years 3: 26 25-50 Years 4: 49 25-50 Years 5: 50 25-50 Years 6: 51 51-100 Years 7: 52 51-100 Years 8: 100 51-100 Years 9: 101 NA 10: 102 NA
请注意,NA
表示查找表中定义的年龄范围中的差距。
InfiniteFlashChess询问了基准测试结果。
任何基准测试都取决于Data
中的行数以及组的数量,即年龄范围。因此,我们将针对100和1M行以及2组(由OP指定)和8组进行基准运行。
2组的基准代码:
library(data.table)
library(dplyr)
n_row <- 1E2L
set.seed(123L)
Data0 <- data.frame(Age = sample.int(105L, n_row, TRUE))
lookup <- data.table(
lower = c(25L, 51L),
upper = c(50L, 100L)
)
lookup[, label := sprintf("%i-%i Years", lower, upper)][]
microbenchmark::microbenchmark(
ifelse = {
copy(Data0) %>%
mutate(Age2 = ifelse(between(Age, 25, 50), "25 - 50 Years",
ifelse(between(Age, 51, 100), "51 - 100 Years",
"")))
},
cut = {
copy(Data0) %>%
mutate(Age2 = cut(Age, c(24,50,100), c("25-50 years","51-100 Years")))
},
baseR = {
data <- copy(Data0)
data$age2 <- ""
data$age2[data$Age %in% 51:100] <- "51 - 100 years"
data$age2[data$Age %in% 25:50] <- "25 - 50 years"
},
join_dt = {
Data <- copy(Data0)
setDT(Data)[lookup, on =.(Age >= lower, Age <= upper), Age2 := label]
},
times = 100L
)
100行的基准测试结果:
Unit: microseconds expr min lq mean median uq max neval cld ifelse 2280.588 2415.006 2994.83792 2501.8495 2827.513 20545.672 100 c cut 2272.280 2407.455 2716.67432 2537.3425 2827.135 7351.495 100 c baseR 57.016 83.446 94.80729 91.1865 106.667 164.248 100 a join_dt 1165.970 1318.889 1560.19394 1485.4025 1691.939 2803.159 100 b
1M行的基准测试结果:
Unit: milliseconds expr min lq mean median uq max neval cld ifelse 618.08286 626.72757 672.28875 639.04973 758.83435 773.25566 10 c cut 197.16467 200.53571 219.58635 203.77460 214.24227 343.56061 10 b baseR 52.96059 59.36964 76.09431 62.19055 66.32506 198.73654 10 a join_dt 66.89256 67.61147 73.33428 72.55457 78.18675 81.69146 10 a
8组的基准测试需要编写嵌套的ifelse()
或重复的子集操作:
breaks <- seq(20, 100, 10)
lookup <- data.table(
lower = head(breaks, -1L),
upper = tail(breaks, -1L)
)
lookup[, label := sprintf("%i-%i Years", lower + 1L, upper)][]
microbenchmark::microbenchmark(
ifelse = {
copy(Data0) %>%
mutate(
Age2 = ifelse(
between(Age, 21, 30), "21 - 20 Years", ifelse(
between(Age, 31, 40), "31 - 40 Years", ifelse(
between(Age, 41, 50), "41 - 50 Years", ifelse(
between(Age, 51, 60), "51 - 60 Years", ifelse(
between(Age, 61, 70), "61 - 70 Years", ifelse(
between(Age, 71, 80), "71 - 80 Years", ifelse(
between(Age, 81, 90), "81 - 90 Years", ifelse(
between(Age, 91, 100), "91 - 100 Years", "")))))))))
},
cut = {
copy(Data0) %>%
mutate(Age2 = cut(Age, breaks))
},
subset = {
data <- copy(Data0)
data$age2 <- ""
data$age2[data$Age %in% 21:30] <- "21 - 30 years"
data$age2[data$Age %in% 31:40] <- "31 - 40 years"
data$age2[data$Age %in% 41:50] <- "41 - 50 years"
data$age2[data$Age %in% 51:60] <- "51 - 60 years"
data$age2[data$Age %in% 61:70] <- "61 - 70 years"
data$age2[data$Age %in% 71:80] <- "71 - 80 years"
data$age2[data$Age %in% 81:90] <- "81 - 90 years"
data$age2[data$Age %in% 91:100] <- "91 - 100 years"
},
join_dt = {
Data <- copy(Data0)
setDT(Data)[lookup, on =.(Age > lower, Age <= upper), Age2 := label]
},
times = 100L
)
100行的基准测试结果:
Unit: microseconds expr min lq mean median uq max neval cld ifelse 2522.617 2663.832 2955.2448 2740.1030 2886.4155 7717.748 100 d cut 2340.622 2470.699 2664.9381 2538.6635 2646.6520 7608.627 100 c subset 174.820 199.741 219.6505 210.5015 231.4575 402.501 100 a join_dt 1198.819 1290.949 1406.2354 1399.1255 1488.4240 1810.500 100 b
1M行的基准测试结果:
Unit: milliseconds expr min lq mean median uq max neval cld ifelse 2427.0599 2429.42131 2539.88611 2457.06191 2565.14682 2992.68891 10 c cut 220.3553 221.53939 243.49476 222.82165 230.06289 406.57277 10 b subset 176.0096 177.92958 199.13398 184.26878 192.60274 323.90338 10 b join_dt 62.7471 64.26875 67.94099 65.07508 75.03169 75.38813 10 a
答案 2 :(得分:0)
以下是使用基础R的解决方案。请注意,由于age2
不能同时为25 - 50
和50 - 100
,因此我将这些类别互相排斥:
data <- data.frame(age = round(runif(100)*100,0),
age2 = rep(" ",100),stringsAsFactors=FALSE)
data$age2[data$age %in% 51:100] <- "51 - 100 years"
data$age2[data$age %in% 25:50] <- "25 - 50 years"
data[1:15,]
...和输出:
> data[1:15,]
age age2
1 0
2 45 25 - 50 years
3 58 51 - 100 years
4 59 51 - 100 years
5 84 51 - 100 years
6 79 51 - 100 years
7 5
8 78 51 - 100 years
9 46 25 - 50 years
10 6
11 73 51 - 100 years
12 37 25 - 50 years
13 5
14 41 25 - 50 years
15 58 51 - 100 years
>