使用如下数据框
text <- "
location_id,brand,count,driven_km,efficiency,mileage,age
23040204995,Toyota,8,2761,0.57,333,2.17
23040204995,Honda,23,2307,0.38,117.5,0.45
23040204995,Tesla,16,3578,0.65,127,0.38
23040204996,Toyota,16,3578,0.65,127,0.38
23040204996,Nissan,38,2504,0.37,563.5,0.74
23040204996,Tesla,24,892,0.32,175,0.48
23040204997,Tesla,11,1879.5,0.67,298.5,0.57
23040204998,Honda,24,892,0.32,175,0.48
"
df <- read.table(textConnection(text), sep=",", header = T)
对于每个location_id
,我需要根据count,driven_km,efficiency,mileage,age
的值为所有品牌计算值Tesla
的差异。需要计算不同的Value for i - Value for Tesla
i={"Toyota", "Honda", "Nissan" ..}
。有location_id
个值Tesla
可能不存在或只有Tesla
的值可能存在,需要忽略它们,因为差异不适合location_id
{ {1}}秒。
我正在寻找一种优雅的方式 - 最好以dplyr
方式。
预期产出
location_id,brand,count,driven_km,efficiency,mileage,age
23040204995,Toyota,-8,-817,-0.08,206,1.79
23040204995,Honda,7,-1271,-0.27,-9.5,0.07
23040204996,Toyota,-8,2686,0.33,-48,-0.1
23040204996,Nissan,14,1612,0.05,388.5,0.26
答案 0 :(得分:3)
使用data.table
,按&#39; location_id&#39;分组,我们在.SDcols
中指定要进行差异的列,通过循环遍历Data.table的子集来获取差异({{1 }})
.SD
如果相应的&#39;品牌&#39;列也是必需的
library(data.table)
setDT(df)[, lapply(.SD, function(x) x[brand != "Tesla"] -
x[brand == "Tesla"]), location_id, .SDcols = count:age]
或者,如果我们使用setDT(df)[, c(list(brand = brand), lapply(.SD, function(x) if("Tesla" %in% brand)
as.numeric(x - x[brand == "Tesla"]) else NA_real_)), location_id, .SDcols = count:age
][brand != "Tesla" & !is.na(count)]
# location_id brand count driven_km efficiency mileage age
#1: 23040204995 Toyota -8 -817 -0.08 206.0 1.79
#2: 23040204995 Honda 7 -1271 -0.27 -9.5 0.07
#3: 23040204996 Toyota -8 2686 0.33 -48.0 -0.10
#4: 23040204996 Nissan 14 1612 0.05 388.5 0.26
tidyverse
答案 1 :(得分:2)
所以我会通过tidyr
来实现dplyr
之类的。
library(tidyr)
dfl <- gather(df, "key", "value", -location_id, -brand)
dflt <- dfl %>% filter(brand == "Tesla")
dfln <- dfl %>% filter(brand != "Tesla")
inner_join(dflt, dfln, by = c("location_id", "key")) %>%
mutate(value = value.y - value.x) %>%
select(location_id, brand = brand.y, key, value) %>%
spread(key,value)
# location_id brand age count driven_km efficiency mileage
# 1 23040204995 Honda 0.07 7 -1271 -0.27 -9.5
# 2 23040204995 Toyota 1.79 -8 -817 -0.08 206.0
# 3 23040204996 Nissan 0.26 14 1612 0.05 388.5
# 4 23040204996 Toyota -0.10 -8 2686 0.33 -48.0
列的排序不同 - 但您可以重新排列它们。