Question

我正在尝试找到一个干净，高效的创建一个新变量，对5个现有变量进行复杂的计算。我的问题是，一个变量是一个因子，另一个变量包含NA。

我有一组包含几组变量的数据集，其结构如下：

支出周期 - 1 =每日，2 =每周，3 =的因子每月，4 =每年
支出1 - 整数，每日花费的金额
支出2 - 整数，每周花费的金额
支出3 - 整数，每月花费的金额
支出4 - 整数，每年花费的金额

对于每行/观察，4个整数字段中只有一个具有取决于expense_period值的数值，其余为NA。

例如：

   expenditure_period  expenditure1  expenditure2  expenditure3  expenditure4
1             monthly            NA            NA             5            NA
2              weekly            NA             5            NA            NA
3             monthly            NA            NA             2            NA
4             monthly            NA            NA             5            NA
5             monthly            NA            NA            58            NA

我想创建一个包含标准月度开支的新变量。因此，如果支出期间是每天，那么支出1 * 30。如果每周，则支出2 * 4.如果是每月，则支出3 * 1。如果是每年，那么支出4/12。

我能想出的最佳解决方案是以下一塌糊涂：

data$expenditure_factor[data$expenditure_period=="daily"] <- 30
data$expenditure_factor[data$expenditure_period=="weekly"] <- 4
data$expenditure_factor[data$expenditure_period=="monthly"] <- 1
data$expenditure_factor[data$expenditure_period=="yearly"] <- 1/12
data$expenditure_month <- apply(data[,c("expenditure1", "expenditure2",
 "expenditure3", "expenditure4", "expenditure_factor")], 1, 
function(x) { sum(x[1:4], na.rm=TRUE) * x[5]} )

我尝试使用+运算符将支出1,2,3,4加在一起，但由于向3个NAs添加了1个数字，因此导致了所有NAs。我尝试使用rm.na的sum函数创建一个临时变量，但这导致每行的总和相同。我尝试使用dplyr包中的mutate，没有效果。

有更简单，更优雅的方法吗？我必须对大约12个不同的支出类别进行相同的处理。如果以前曾经问过我，我道歉，我找不到类似的帖子。如果已有，请指导我。

我在Windows 7上使用RStudio和R 3.2.3。

Answer 1

＆＃34;清洁，高效＆＃34;是一种意见，但如果您暂时没有查看代码，以下内容将非常容易维护和理解。它将数据保存在单独的表中，一次完成一件事，并且可以在步骤之间进行检查。

# conversion table to replace bulk of mess with slightly better mess of code that is easy to inspect
expenditure_factor <- data.frame(expenditure_period = c('daily','weekly','monthly','yearly'),
                                 pfactor = c(30,4,1,1/12),
                                 stringsAsFactors = F)

# sum total expenditure (expenditurex) and remove extra columns
data$sumexpenditure <- apply(data[ ,2:5],1,sum,na.rm = T)
data$expenditure1 <- data$expenditure2 <- data$expenditure3 <- data$expenditure4 <- NULL

# add factor from conversion table
data <- merge(data,expenditure_factor,by = 'expenditure_period',all.x = T)

# calculate final answer
data$expenditure_month <- data$sumexpenditure * data$pfactor

或者这可以推到一个单行。

假设spend_period是一个字符变量：

data$expenditure_period <- as.character(data$expenditure_period)

然后：

# sum total expenditure
data$sumexpenditure <- apply(data[ ,2:5],1,sum,na.rm = T)

# use an index
data$expenditure_factor <- c(30,4,1,1/12)[match(data$expenditure_period,c('daily','weekly','monthly','yearly'))]

# calculate final answer
data$expenditure_month <- data$sumexpenditure * data$expenditure_factor

Answer 2

好吧，这可能是一种有点非常规的方法，但是如果你重命名列以使它们包含多个选项，重新设置数据并提取乘数以用于计算新变量，该怎么办呢？

library(dplyr)
library(tidyr)

# New cols
data<-rename(data, expenditure.30 = expenditure1, 
            expenditure.4 = expenditure2,
            expenditure.1 = expenditure3,
            `expenditure.1/2` = expenditure4)

# Reshape and calculate new col
data %>% gather(exp_new,exp_val,expenditure.30:`expenditure.1/2`) %>% 
        mutate(mont_exp = exp_val * as.numeric(sub('.*\\.', '', exp_new))) %>%
        na.omit()
#   expenditure_period       exp_new exp_val mont_exp
#7              weekly expenditure.4       5       20
#11            monthly expenditure.1       5        5
#13            monthly expenditure.1       2        2
#14            monthly expenditure.1       5        5
#15            monthly expenditure.1      58       58

从多个变量的复杂计算中计算新变量的更好方法，一些NAs

2 个答案: