我正在尝试找到一个干净,高效的创建一个新变量,对5个现有变量进行复杂的计算。我的问题是,一个变量是一个因子,另一个变量包含NA。
我有一组包含几组变量的数据集,其结构如下:
对于每行/观察,4个整数字段中只有一个具有取决于expense_period值的数值,其余为NA。
例如:
expenditure_period expenditure1 expenditure2 expenditure3 expenditure4
1 monthly NA NA 5 NA
2 weekly NA 5 NA NA
3 monthly NA NA 2 NA
4 monthly NA NA 5 NA
5 monthly NA NA 58 NA
我想创建一个包含标准月度开支的新变量。因此,如果支出期间是每天,那么支出1 * 30。如果每周,则支出2 * 4.如果是每月,则支出3 * 1。如果是每年,那么支出4/12。
我能想出的最佳解决方案是以下一塌糊涂:
data$expenditure_factor[data$expenditure_period=="daily"] <- 30
data$expenditure_factor[data$expenditure_period=="weekly"] <- 4
data$expenditure_factor[data$expenditure_period=="monthly"] <- 1
data$expenditure_factor[data$expenditure_period=="yearly"] <- 1/12
data$expenditure_month <- apply(data[,c("expenditure1", "expenditure2",
"expenditure3", "expenditure4", "expenditure_factor")], 1,
function(x) { sum(x[1:4], na.rm=TRUE) * x[5]} )
我尝试使用+运算符将支出1,2,3,4加在一起,但由于向3个NAs添加了1个数字,因此导致了所有NAs。我尝试使用rm.na的sum函数创建一个临时变量,但这导致每行的总和相同。我尝试使用dplyr包中的mutate,没有效果。
有更简单,更优雅的方法吗?我必须对大约12个不同的支出类别进行相同的处理。如果以前曾经问过我,我道歉,我找不到类似的帖子。如果已有,请指导我。
我在Windows 7上使用RStudio和R 3.2.3。
答案 0 :(得分:0)
&#34;清洁,高效&#34;是一种意见,但如果您暂时没有查看代码,以下内容将非常容易维护和理解。它将数据保存在单独的表中,一次完成一件事,并且可以在步骤之间进行检查。
# conversion table to replace bulk of mess with slightly better mess of code that is easy to inspect
expenditure_factor <- data.frame(expenditure_period = c('daily','weekly','monthly','yearly'),
pfactor = c(30,4,1,1/12),
stringsAsFactors = F)
# sum total expenditure (expenditurex) and remove extra columns
data$sumexpenditure <- apply(data[ ,2:5],1,sum,na.rm = T)
data$expenditure1 <- data$expenditure2 <- data$expenditure3 <- data$expenditure4 <- NULL
# add factor from conversion table
data <- merge(data,expenditure_factor,by = 'expenditure_period',all.x = T)
# calculate final answer
data$expenditure_month <- data$sumexpenditure * data$pfactor
或者这可以推到一个单行。
假设spend_period是一个字符变量:
data$expenditure_period <- as.character(data$expenditure_period)
然后:
# sum total expenditure
data$sumexpenditure <- apply(data[ ,2:5],1,sum,na.rm = T)
# use an index
data$expenditure_factor <- c(30,4,1,1/12)[match(data$expenditure_period,c('daily','weekly','monthly','yearly'))]
# calculate final answer
data$expenditure_month <- data$sumexpenditure * data$expenditure_factor
答案 1 :(得分:0)
好吧,这可能是一种有点非常规的方法,但是如果你重命名列以使它们包含多个选项,重新设置数据并提取乘数以用于计算新变量,该怎么办呢?
library(dplyr)
library(tidyr)
# New cols
data<-rename(data, expenditure.30 = expenditure1,
expenditure.4 = expenditure2,
expenditure.1 = expenditure3,
`expenditure.1/2` = expenditure4)
# Reshape and calculate new col
data %>% gather(exp_new,exp_val,expenditure.30:`expenditure.1/2`) %>%
mutate(mont_exp = exp_val * as.numeric(sub('.*\\.', '', exp_new))) %>%
na.omit()
# expenditure_period exp_new exp_val mont_exp
#7 weekly expenditure.4 5 20
#11 monthly expenditure.1 5 5
#13 monthly expenditure.1 2 2
#14 monthly expenditure.1 5 5
#15 monthly expenditure.1 58 58