我正在从SAS逐渐过渡到R,此刻我正试图复制我以前对宏所做的工作。
我有一个包含所有数据的表(我们将其称为IDF_pop),并从该表中创建另外两个表:YVE_pop和EPCI_pop,它们是主表的两个子集。我更喜欢创建单独的表,但是我想这可能不是最佳选择。这是我的操作方法:
## Let's say the main table contains 10 lines.
## codgeo is the city's postal code, epci is the area, and I have three
## variables that describe different parts of the population
codgeo <- c("75014","75020","78300","78520","78650","91200","91600","92500","93100","95230")
epci <- c("001","001","002","002","003","004","004","005","006","007")
pop0_15 <- c(10000*runif(10))
pop15_64 <- c(10000*runif(10))
pop65p <- c(10000*runif(10))
IDF_pop <- data.frame(codgeo,epci,pop0_15,pop15_64,pop65p)
## I'd like my population to be in one single column, for this I'll use melt
IDF_pop_line <- melt(IDF_pop,c("codgeo","epci"))
## Now I want to create separate tables for the Yvelines department (codgeo starts with 78) and for EPCI 002
## I could do it in two lines but I wanted to train using functions so here goes
localisation <- function(code_dep, lib_dep, code_epci, lib_epci){
do.call("<<-",
list(paste0(eval(lib_dep),"_pop_ligne"),
IDF_pop_line %>% filter(stri_sub(codgeo,from=1,length=2)==code_dep)
)
)
do.call("<<-",
list(paste0(eval(lib_epci),"_pop_ligne"),
IDF_pop_line %>% filter(epci==code_epci)
)
)
}
do.call("localisation",list("78","YVE","002","GPSO"))
有了这个,我有了3个表(IDF_,YVE_,GPSO_),现在可以解决主要问题了。
我接下来要做的是总结我的表。我正在尝试编写一个适用于所有3个表的函数。
我希望它完全依赖于该参数,但是看来do.call不会在其第二个参数中接受paste0。
## Aggregating the tables. I'll call the function 3 times, one for each level.
agregation <- function(lib){
# This doesn't :
do.call("<<-",
list(paste0(eval(lib),"_pop_agr"),
paste0(eval(lib),"_pop_line") %>%
group_by(variable) %>%
summarise(pop = sum(value))
)
)
}
do.call("agregation",list("IDF")) # This one doesn't work
agregation2 <- function(lib){
do.call("<<-",
list(paste0(eval(lib),"_pop_agr"),
IDF_pop_line %>%
group_by(variable) %>%
summarise(pop = sum(value))
)
)
}
do.call("agregation2",list("IDF")) # This one does
如您所见,到目前为止,我发现的唯一可行的方法是编写用于聚合的表的全名。但这违背了具有可以自由参数化的东西的最初想法。 如何修改函数的第一个版本,使其对所有三个可能的参数都起作用?
最后,我知道一个简单的解决方法是保留我的IDF_pop_line表并在最后一刻进行过滤以创建3个聚合表,但是我更喜欢从一开始就使用单独的表。
预先感谢您的帮助!
答案 0 :(得分:0)
在agregation
函数字符串paste0(eval(lib),"_pop_line")
中返回数据框的名称,而不是数据框本身。
尝试get
agregation <- function(lib){
do.call("<<-",
list(paste0(eval(lib),"_pop_agr"),
get(paste0(eval(lib),"_pop_line")) %>%
group_by(variable) %>%
summarise(pop = sum(value))
)
)
}
答案 1 :(得分:0)
以下是使用data.table
的建议。
您可以在输入所有功能之前使用创建的IDF_pop
。
library(data.table)
#make adata.table out of YVE_pop_ligne
setDT( IDF_pop )
#create groups to summarise by
IDF_pop[ epci == "002", GSPO := TRUE][]
IDF_pop[ grepl("^78", codgeo) , YVE := TRUE][]
#melt and filter only values where a filter is TRUE
dt <- data.table::melt( IDF_pop,
id.vars = c("codgeo", "epci", "pop0_15", "pop15_64", "pop65p"),
measure.vars = c("GSPO", "YVE"))[ value == TRUE,][]
在结果之间(dt)
# codgeo epci pop0_15 pop15_64 pop65p variable value
# 1: 78300 002 6692.394 5441.225 4008.875 GSPO TRUE
# 2: 78520 002 2128.604 6808.004 1889.822 GSPO TRUE
# 3: 78300 002 6692.394 5441.225 4008.875 YVE TRUE
# 4: 78520 002 2128.604 6808.004 1889.822 YVE TRUE
# 5: 78650 003 8482.971 6556.482 5098.929 YVE TRUE
代码
#now summarising is easy, sum by varianle-group on all pop-columns
dt[, lapply( .SD, sum), by = variable, .SDcols = names(dt)[grepl("^pop", names(dt) )] ]
最终输出
# variable pop0_15 pop15_64 pop65p
# 1: GSPO 7171.683 5855.894 11866.55
# 2: YVE 12602.153 8028.948 14364.21