我的数据看起来像这样
bankname bankid year deposit dep_cert capital surplus
Bank A 1 1881 244789 7250 20218 29513
Bank B 2 1881 195755 10243 185151 NA
Bank C 3 1881 107736 13357 177612 NA
Bank D 4 1881 170600 NA 20000 NA
Bank E 5 1881 320000 351266 314012 NA
这是复制数据的代码。
bankname <- c("Bank A","Bank B","Bank C","Bank D","Bank E")
bankid <- c( 1, 2, 3, 4, 5)
year<- c( 1881, 1881, 1881, 1881, 1881)
deposit <- c(244789, 195755, 107736, 170600, 32000000)
dep_cert<-c(7250,10243,13357,NA,351266)
capital<-c(20218,185151,177612,20000,314012)
surplus<-c(29513,NA,NA,NA,NA)
bankdata<-data.frame(bankname, bankid,year,deposit, dep_cert, capital, surplus)
我想创建一个名为liability
的新列,作为deposit
,dep_cert
,capital
和surplus
的总和。这意味着数据看起来像这样。
bankname bankid year deposit dep_cert capital surplus liability
Bank A 1 1881 244789 7250 20218 29513 301770
Bank B 2 1881 195755 10243 185151 NA 391149
Bank C 3 1881 107736 13357 177612 NA 298705
Bank D 4 1881 170600 NA 20000 NA 190600
Bank E 5 1881 320000 351266 314012 NA 32665278
但是,当我在R中使用sum
命令时,由于缺少值,我得到了NAs。在Stata,我会做
egen liability = rowtotal(deposit, dep_cert,capital, surplus)
R中的等效代码是什么?
另外,我的第二个问题是,用Stata中的数字0替换所有缺失值(NAs),我会做
foreach x of varlist deposit dep_cert capital surplus {
replace `x'=0 if missing(`x')
}
R中的等效代码是什么?
答案 0 :(得分:5)
在这种情况下,等效值为rowSums
:
rowSums(bankdata[c("deposit", "dep_cert", "capital", "surplus")], na.rm = TRUE)
# [1] 301770 391149 298705 190600 32665278
您遗失的主要内容是na.rm = TRUE
参数。
要将其添加到data.frame
,您可以执行以下操作:
bankdata$liability <- rowSums(bankdata[c("deposit", "dep_cert",
"capital", "surplus")],
na.rm = TRUE)
要在同一列中将NA
值替换为“0”,您可以执行以下操作:
## save some typing
cols <- c("deposit", "dep_cert", "capital", "surplus")
bankdata[cols][is.na(bankdata[cols])] <- 0
bankdata
# bankname bankid year deposit dep_cert capital surplus
# 1 Bank A 1 1881 244789 7250 20218 29513
# 2 Bank B 2 1881 195755 10243 185151 0
# 3 Bank C 3 1881 107736 13357 177612 0
# 4 Bank D 4 1881 170600 0 20000 0
# 5 Bank E 5 1881 32000000 351266 314012 0
答案 1 :(得分:2)
使用data.table
library(data.table)
nm1 <- c("deposit", "dep_cert", "capital", "surplus")
setDT(bankdata)[,liabiliy:=Reduce(`+`,
lapply(.SD, function(x) replace(x, is.na(x), 0))),.SDcols=nm1]
bankdata
# bankname bankid year deposit dep_cert capital surplus liabiliy
#1: Bank A 1 1881 244789 7250 20218 29513 301770
#2: Bank B 2 1881 195755 10243 185151 NA 391149
#3: Bank C 3 1881 107736 13357 177612 NA 298705
#4: Bank D 4 1881 170600 NA 20000 NA 190600
#5: Bank E 5 1881 32000000 351266 314012 NA 32665278
将NA
替换为0
并执行rowSums
setDT(bankdata)[, (nm1):=lapply(.SD, function(x)
replace(x, is.na(x), 0)), .SDcols=nm1][,
liability:=Reduce(`+`, .SD), .SDcols=nm1]
bankdata
# bankname bankid year deposit dep_cert capital surplus liability
#1: Bank A 1 1881 244789 7250 20218 29513 301770
#2: Bank B 2 1881 195755 10243 185151 0 391149
#3: Bank C 3 1881 107736 13357 177612 0 298705
#4: Bank D 4 1881 170600 0 20000 0 190600
#5: Bank E 5 1881 32000000 351266 314012 0 32665278
bankdata1 <- bankdata[rep(1:nrow(bankdata), 1e5),]
row.names(bankdata1) <- 1:nrow(bankdata1)
f1 <- function() {rowSums(bankdata1[c("deposit", "dep_cert",
"capital", "surplus")],
na.rm = TRUE)
}
f2 <- function() {nm1 <- c("deposit", "dep_cert", "capital", "surplus")
DT <- data.table(bankdata1, key=c('bankname', 'bankid', 'year'))
DT[, liabiliy:=Reduce(`+`,
lapply(.SD, function(x) replace(x, is.na(x), 0))),.SDcols=nm1]
}
library(microbenchmark)
microbenchmark(f1(), f2(), unit="relative")
# Unit: relative
#expr min lq median uq max neval
#f1() 1.558999 1.355819 1.457036 1.426796 1.525313 100
#f2() 1.000000 1.000000 1.000000 1.000000 1.000000 100
答案 2 :(得分:1)
不是一个完整的答案,但是太长了,无法发表评论:
您最初声明的Stata代码
foreach `x' of varlist deposit dep_cert capital surplus {
replace `x'=0 if missing(`x')
}
(1)不起作用(2)任何方式都是一个坏主意。
这样可行
foreach x of varlist deposit dep_cert capital surplus {
replace `x' = 0 if missing(`x')
}
这也可行,而且更简洁,
foreach x in deposit dep_cert capital surplus {
replace `x' = 0 if missing(`x')
}
但是在原始数据中用零填写缺失可能会导致信息丢失,并且会严重影响数据的完整性。默认情况下,egen
忽略计算行总计时的缺失,在此上下文中不需要任何方式。
答案 3 :(得分:0)
对于这两项任务,您还可以使用dplyr包中的mutate
。
library(dplyr)
vars <- c("deposit", "dep_cert", "capital", "surplus")
正如A Handcart And Mohair在answer中所解释的那样,您可以rowSums
与na.rm = TRUE
一起使用:
bankdata = bankdata %>%
mutate(liability = rowSums(.[vars], na.rm = TRUE))
我还建议您不要这样做(请参阅Nick Cox的评论),但如果需要,您可以将mutate_
与replace
一起使用(另请参阅{{3在SO)。
var_fun <- paste("replace(", vars, ", is.na(", vars, "), 0)", sep="")
bankdata = bankdata %>%
mutate_(.dots = setNames(var_fun, eval(vars)))
setNames
创建一个包含变量名称和生成变量的函数的向量。您需要在此处使用mutate_
的下划线变体才能使用带引号的变量名称。该技术在this answer(非标准评估)中有更详细的解释。