r data.frame创建一个条件列

时间:2015-10-05 17:50:50

标签: r if-statement dataframe conditional

我的数据如下。我想要一个名为accuracy_level的新列。我怎么能完成它?我试过if,但效果不好。

如果

  • accuracy_percentage在+/- 10%之内,然后accuracy_level将为“Good”
  • accuracy_percentage在+/- 30%之内且在+/- 10%之外,然后accuracy_level将为“Bad”
  • accuracy_percentage超出+/- 30%,然后accuracy_level将是“最差”

这是我的代码:

actuals=seq(0,10,0.1)
forecast=seq(10,0,-0.1)
data1=data.frame(actuals,forecast)
data1$diff=data1$actuals-data1$forecast
data1$accuracy_percentage=(data1$diff/data1$actuals)*100
if((data1$accuracy_percentage < 10)&(data1$accuracy_percentage > -10),data1$accuracy_level="good",)

2 个答案:

答案 0 :(得分:3)

data1$accuracy_level <- cut(abs(data1$accuracy_percentage), c(0, 10, 30, Inf), c("Good", "Bad", "Worst"), include.lowest=T)
#    actuals forecast diff accuracy_percentage accuracy_level
# 19     1.8      8.2 -6.4          -355.55556          Worst
# 71     7.0      3.0  4.0            57.14286          Worst
# 57     5.6      4.4  1.2            21.42857            Bad
# 17     1.6      8.4 -6.8          -425.00000          Worst
# 92     9.1      0.9  8.2            90.10989          Worst
# 91     9.0      1.0  8.0            88.88889          Worst
# 13     1.2      8.8 -7.6          -633.33333          Worst
# 79     7.8      2.2  5.6            71.79487          Worst
# 44     4.3      5.7 -1.4           -32.55814          Worst
# 51     5.0      5.0  0.0             0.00000           Good

使用cut可以提高速度和可扩展性。我们根据切割点abs找到精度百分比的绝对值c(0, 10, 30, Inf)的间隔。并为团体提供标签。我们还为include.lowest=TRUE案例添加了0.000参数,这些案例属于我们的分界点的下限。

使用嵌套的ifelse语句,因为它们在读出时很容易理解。但如果你必须嵌套10种不同的条件,它很容易失控。

作为一个注释,如果我们不需要新的标签名称,我们可以使用相关的函数findInterval,它基本上会做同样的事情,不同的是将整数值指定为输出(即{{1} })。

答案 1 :(得分:2)

我使用了化合物ifelse

data1$accuracy_category <- ifelse(abs(data1$accuracy_percentage)<10, "Good",
                                  ifelse(abs(data1$accuracy_percentage)<30, "Bad", "Worst"))

产量

> head(data1)
  actuals forecast  diff accuracy_percentage accuracy_category
1     0.0     10.0 -10.0                -Inf             Worst
2     0.1      9.9  -9.8           -9800.000             Worst
3     0.2      9.8  -9.6           -4800.000             Worst
4     0.3      9.7  -9.4           -3133.333             Worst
5     0.4      9.6  -9.2           -2300.000             Worst
6     0.5      9.5  -9.0           -1800.000             Worst

正如@ pierre-lafortune所指出的那样,它更容易阅读,但性能较差。本着Knuth的精神,我进行了一些测试。初始设置:

> system.time(data1$accuracy_category <- ifelse(abs(data1$accuracy_percentage)<10, "Good",
+ ifelse(abs(data1$accuracy_percentage)<30, "Bad", "Worst")))
   user  system elapsed 
      0       0       0 
> system.time(data1$accuracy_level <- cut(abs(data1$accuracy_percentage), c(0, 10, 30, Inf), c("Good", "Bad", "Worst"), include.lowest=T))
   user  system elapsed 
  0.000   0.000   0.001

但这并不能说明问题。所以,让我们开始吧:)用

actuals=seq(0,100000,0.1)
forecast=seq(100000,0,-0.1)

我得到了

> system.time(data1$accuracy_category <- ifelse(abs(data1$accuracy_percentage)<10, "Good",
+ ifelse(abs(data1$accuracy_percentage)<30, "Bad", "Worst")))
   user  system elapsed 
  0.776   0.060   0.840 
> system.time(data1$accuracy_level <- cut(abs(data1$accuracy_percentage), c(0, 10, 30, Inf), c("Good", "Bad", "Worst"), include.lowest=T))
   user  system elapsed 
  0.152   0.003   0.155 

确实表明cut在扩展时会更高效。所有这些都说cut更优雅,如果不那么可读,我赞成他的答案:) ymmv。