我在R中有一个大数据框,所有看起来都像这样:
name amount date1 date2 days_out year
JEAN 318.5 1971-02-16 1972-11-27 650 days 1971
GREGORY 1518.5 <NA> <NA> NA days 1971
JOHN 318.5 <NA> <NA> NA days 1971
EDWARD 318.5 <NA> <NA> NA days 1971
WALTER 518.5 1971-07-06 1975-03-14 1347 days 1971
BARRY 1518.5 1971-11-09 1972-02-09 92 days 1971
LARRY 518.5 1971-09-08 1972-02-09 154 days 1971
HARRY 318.5 1971-09-16 1972-02-09 146 days 1971
GARRY 1018.5 1971-10-26 1972-02-09 106 days 1971
如果某人的days_out小于60,则可获得90%的折扣。 60-90,70%的折扣。我需要找出每年所有金额的折扣金额。我完全令人尴尬的解决方法是编写一个python脚本,编写一个R脚本,每个相关年份都是这样的:
tmp <- members[members$year==1971, ]
tmp90 <- tmp[tmp$days_out <= 60 & tmp$days_out > 0 & !is.na(tmp$days_out), ]
tmp70 <- tmp[tmp$days_out <= 90 & tmp$days_out > 60 & !is.na(tmp$days_out), ]
tmp50 <- tmp[tmp$days_out <= 120 & tmp$days_out > 90 & !is.na(tmp$days_out), ]
tmp30 <- tmp[tmp$days_out <= 180 & tmp$days_out >120 & !is.na(tmp$days_out), ]
tmp00 <- tmp[tmp$days_out > 180 | is.na(tmp$days_out), ]
details.1971 <- c(1971, nrow(tmp),
nrow(tmp90), sum(tmp90$amount), sum(tmp90$amount) * .9,
nrow(tmp70), sum(tmp70$amount), sum(tmp70$amount) * .7,
nrow(tmp50), sum(tmp50$amount), sum(tmp50$amount) * .5,
nrow(tmp30), sum(tmp30$amount), sum(tmp90$amount) * .9,
nrow(tmp00), sum(tmp00$amount))
membership.for.chart <- rbind(membership.for.chart,details.1971)
它运作得很好。 tmp帧和向量被覆盖,这很好。但是我知道我已经完全击败了R这里优雅高效的一切。我一个月前第一次推出了R,我想我已经走了很长的路。但我真的想知道我应该怎么做呢?
答案 0 :(得分:2)
您可以使用cut
功能或findInterval
功能。确切的代码将取决于对象的内部结构,这些内部结构与控制台输出没有明确的通信。如果days_out
是difftime-object。那么这样的事情可能有用:
disc_amt <- with(tmp, amount*c(.9, .7, .5, .9, 1)[
findInterval(days_out, c(0, 60, 90, 120, 180, Inf] )
您应该在dput()
对象上发布tmp
的输出,或者如果它真的很大,可能会dput(head(tmp, 20))
,并且可以继续进行测试。 (实际折扣似乎没有按照我预期的方式订购。)
答案 1 :(得分:2)
希望这会让你开始:
#Import your data; add dummy column to separate 'days' suffix into its own column
dat <- read.table(text = " name amount date1 date2 days_out dummy year
JEAN 318.5 1971-02-16 1972-11-27 650 days 1971
GREGORY 1518.5 <NA> <NA> NA days 1971
JOHN 318.5 <NA> <NA> NA days 1971
EDWARD 318.5 <NA> <NA> NA days 1971
WALTER 518.5 1971-07-06 1975-03-14 1347 days 1971
BARRY 1518.5 1971-11-09 1972-02-09 92 days 1971
LARRY 518.5 1971-09-08 1972-02-09 154 days 1971
HARRY 318.5 1971-09-16 1972-02-09 146 days 1971
GARRY 1018.5 1971-10-26 1972-02-09 106 days 1971",header = TRUE,sep = "")
#Repeat 3 times
df <- rbind(dat,dat,dat)
#Create new year variable
df$year <- rep(1971:1973,each = nrow(dat))
#Breaks for discount levels
ct <- c(0,60,90,120,180,Inf)
#Cut into a factor
df$fac <- cut(df$days_out,ct)
#Create discount amounts for each row
df$discount <- c(0.9,0.7,0.5,0.9,1)[df$fac]
df$discount[is.na(df$discount)] <- 1
#Calc adj amount
df$amount_adj <- with(df,amount * discount)
#I use plyr a lot, but there are many, many
# alternatives
library(plyr)
ddply(df,.(year),summarise,
amt = sum(amount_adj),
total = length(year),
d60 = length(which(fac == "(0,60]")))
我只计算了上一个ddply
命令中的一些汇总值。我假设你可以自己扩展它。