是否有一种有效的方法来为data.table中的每组因子运行一个函数?

时间:2019-04-18 12:59:27

标签: r data.table

我正在尝试尽可能有效地解决此问题,我不知道到目前为止我所获得的是不是最好的选择。你们还有其他选择吗?

tr <-data.table(industry=as.factor(c("a","a","a","b","b","b")), country=c("ch","gb", "us", "gb", "us", "us"), rat1=c(11,41,3,2,5,7), rat2=c(5,4,2,77,2,3))

SummaryStat <- function(tab, rat, var, val){
  if (!missing(var) & !missing(val)) {
    tab <- tab[eval(var)==val]
  }
  else{
    var = "NA"
    val = "NA"
  }
  #keep only the ratio column
  tab <- tab[, get(rat)]
  #Subset the tab accordingly to function parameters
  summary.result <- data.frame(N=length(tab),
                               min=min(tab),
                               max=max(tab),
                               row.names=rat)
  #return the previously produced summary with the quantiles of the ratio
  return(summary.result)
}


for (nrat in 1:length(names(tr)[grep("rat", names(tr))])) {
  #LOOP ALL THE INDUSTRIES
  for (nind in 1:length(levels(tr[, industry]))) {
    #print in a .csv file the summary of the ratio for the industry 
    write.table(SummaryStat(tr, rat=names(tr)[grep("rat", names(tr))][nrat], 
                            var = quote(industry), val = levels(tr[, industry])[nind]),
                file="test.csv", sep=";", col.names = NA, append=T)
  }
  #LOOP ALL THE COUNTRIES
  for (ncou in 1:length(levels(tr[, country]))) {
    #print in a .csv file the summary of the ratio for the country 
    write.table(SummaryStat(tr, rat=names(tr)[grep("rat", names(tr))][nrat], 
                            var = quote(country), val = levels(tr[, country])[ncou]),
                file="test.csv", sep=";", col.names = NA, append=T)

  }
}

我得到的输出正是我想要的(实际上,如果每个函数的列名都不会重复,那会很好),但是我想知道是否可以找到一种更好的方法(在哪里做for循环)。

(以该功能为例,我的参数相同,但是更复杂,我想避免在那里进行任何更改)

1 个答案:

答案 0 :(得分:1)

我会尝试立即执行此操作,然后保存输出。我相信这符合您的需求,否则请让我知道:)

# try converting to long format, and then using the by conditions to get 
# aggregate views
# melt is used to convert wide to long, splitting columns over combinations 
# of the id.vars
tr2 <- melt(tr, id.vars = c("industry", "country"))
# do the aggregations, at (1) industry level, (2) at country level
sol1 <- tr2[, .(N=.N, min=min(value), max=max(value)), by=.(variable, industry)]
sol2 <- tr2[, .(N=.N, min=min(value), max=max(value)), by=.(variable, country)]
# sense check
sol1[]
sol2[]

编辑:抱歉,忘记了N列。 .N是用于计数的data.table语法

编辑:评论...

SummaryStat <- function(table, ids){ 
  table <- melt(table, id.vars = ids)

  output <- lapply(ids, function(index){
    table[, .(N=.N, min=min(value), max=max(value)), by=c("variable", index)] 
  })
  names(output) <- ids
  return(output)
} 

SummaryStat(tr, c("industry", "country"))