变量创建-推断年龄

时间:2020-11-12 16:17:40

标签: r variables inference feature-engineering data-wrangling

我有一个分组的数据框;

Truck <- c('A','A','A','A','B','B','B','B','C','C','C','C')
OilChanged <- c('True','NewOil','False','False','False','False','False','False','True','NewOil','True','NewOil')
Odometer <- c(1000, 1000, 2000,3000,700,800,900,1000,20000,20000,30000,40000)
DF <- data.frame(Truck, OilChanged, Odometer)

# Truck OilChanged Odometer
# 1      A       True     1000
# 2      A     NewOil     1000
# 3      A      False     2000
# 4      A      False     3000
# 5      B      False      700
# 6      B      False      800
# 7      B      False      900
# 8      B      False     1000
# 9      C       True    20000
# 10     C     NewOil    20000
# 11     C       True    30000
# 12     C     NewOil    40000

我想尽可能地推断出石油的年龄(以公里为单位)。仅在换油后才可以进行推断。如果没有换油,那么油的年龄将仍然是个谜(例如:卡车B)。

下面是期望的结果;

Truck <- c('A','A','A','A','B','B','B','B','C','C','C','C')
OilChanged <- c('True','NewOil','False','False','False','False','False','False','True','NewOil','True','NewOil')
Odometer <- c(1000, 1000, 2000, 3000,700,800,900,1000,20000,20000,30000,40000)
OilAge <- c(NA,0,1000,2000,NA,NA,NA,NA,NA,0,10000,0)
Result <- data.frame(Truck, OilChanged, Odometer, OilAge)


# Truck OilChanged Odometer OilAge
# 1      A       True     1000     NA
# 2      A     NewOil     1000      0
# 3      A      False     2000   1000
# 4      A      False     3000   2000
# 5      B      False      700     NA
# 6      B      False      800     NA
# 7      B      False      900     NA
# 8      B      False     1000     NA
# 9      C       True    20000     NA
# 10     C     NewOil    20000      0
# 11     C       True    30000  10000
# 12     C     NewOil    40000      0

注意:在 True oilchanged (真换油)行与紧随 NewOil 行的行之间的里程表读数将始终相同。因为在更换机油之前直接取样了机油。但是必须保留这两行,以使下游计算正常运行,例如变化率公式。

OilAge列中的不适用表示年龄是个谜。

2 个答案:

答案 0 :(得分:1)

请告诉我此解决方案是否适合您。

Truck <- c('A','A','A','A','B','B','B','B','C','C','C','C')
OilChanged <- c('True','NewOil','False','False','False','False','False','False','True','NewOil','True','NewOil')
Odometer <- c(1000, 1000, 2000,3000,700,800,900,1000,20000,20000,30000,30000)
DF <- data.frame(Truck, OilChanged, Odometer)

DF %>%
  group_by(Truck) %>%
  mutate(status = length(unique(OilChanged)),
         OilAge = ifelse(OilChanged == "NewOil", 0,
                         ifelse(OilChanged == "False", Odometer - (Odometer - lag(Odometer)),
                                ifelse(OilChanged == "True", Odometer - lag(Odometer), NA)))) %>%
  mutate(OilAge = ifelse(status !=1, OilAge, NA)) %>%
  subset(select = c(Truck, OilChanged, Odometer, OilAge))

答案 1 :(得分:1)

另一种方法

DF %>% group_by(Truck)  %>%
  mutate(d = cumsum(OilChanged == 'NewOil')) %>%
  group_by(Truck, d) %>%
  mutate(OilAge = cumsum(c(0*NA^(as.logical(!(first(d)))), diff(NA^(as.logical(!d))*Odometer))))

# A tibble: 12 x 5
# Groups:   Truck, d [6]
   Truck OilChanged Odometer     d OilAge
   <chr> <chr>         <dbl> <int>  <dbl>
 1 A     True           1000     0     NA
 2 A     NewOil         1000     1      0
 3 A     False          2000     1   1000
 4 A     False          3000     1   2000
 5 B     False           700     0     NA
 6 B     False           800     0     NA
 7 B     False           900     0     NA
 8 B     False          1000     0     NA
 9 C     True          20000     0     NA
10 C     NewOil        20000     1      0
11 C     True          30000     1  10000
12 C     NewOil        30000     2      0

d 是一个虚拟变量,您可以在了解已完成的操作后取消选择