如何对数据表的列和行进行复杂计算?

时间:2016-08-12 17:04:25

标签: r data.table

我正在学习操作data.table变量的语法。虽然我可以做简单的事情,但我的理解对于更复杂的任务来说还不够彻底。例如,我想将以下数据转换为每行具有一个不同的“类型”值,基于“子类型”的值生成单独的列,并且当存在具有相同“类型/子类型的多个行时折叠唯一值“组合。

给出输入数据:

data = data.frame(
    var1 = c("a","b","c","b","d","e","f"),
    var2 = c("aa","bb","cc","dd","ee","ee","ff"),
    subtype = c("1","2","2","2","1","1","2"),
    type = c("A","A","A","A","B","B","B")
    )

  var1 var2 subtype type
1    a   aa       1    A
2    b   bb       2    A
3    c   cc       2    A
4    b   dd       2    A
5    d   ee       1    B
6    e   ee       1    B
7    f   ff       2    B

我想得出:

  1.var1 1.var2 2.var1 2.var2     2.type
A "a"    "aa"   "b|c"  "bb|cc|dd" "A"   
B "d|e"  "ee"   "f"    "ff"       "B"   

使用数据框,我可以使用以下代码实现此目的:

data.derived = do.call(
    rbind,
    lapply(
        split(data,list(data$type)),
        function(x) {
            do.call (
                c,
                lapply(
                    split(x, list(x$subtype)),
                    function(y) {
                        result = c(
                            var1 = paste(unique(y$var1),collapse ="|"),
                            var2 = paste(unique(y$var2),collapse ="|")
                        )
                        if (as.character(y$subtype[1]) == "2") {
                            result = c(result, type = as.character(y$type[1]))
                        }
                        result}))}))

如何使用数据表执行相同的操作?

2 个答案:

答案 0 :(得分:5)

从结果中可以清楚地看到,您正在将数据从长格式转换为宽格式,并且子类型沿行方向分布,因此您需要dcast中的data.table。由于您希望将var1var2中的值汇总为单个字符串,因此您需要将聚合函数自定义为paste以折叠结果:

library(data.table)
setDT(data)
dcast(data, type ~ subtype, value.var = c("var1", "var2"), 
            fun = function(v) paste0(unique(v), collapse = "|"))

#    type var1_function_1 var1_function_2 var2_function_1 var2_function_2
# 1:    A               a             b|c              aa        bb|cc|dd
# 2:    B             d|e               f              ee              ff

答案 1 :(得分:1)

不确定您是否要使用data.table包和命令,或者您是否想知道您的代码是否也适用于数据表。

我认为复杂的计算需要使用适当的包。上面的脚本适合你,但是如果它不是由你写的,很难看出它的作用。

在开始使用data.table之前,请检查一些不错的软件包,让您的生活更轻松。像

library(dplyr)
library(tidyr)

data = data.frame(
  var1 = c("a","b","c","b","d","e","f"),
  var2 = c("aa","bb","cc","dd","ee","ee","ff"),
  subtype = c("1","2","2","2","1","1","2"),
  type = c("A","A","A","A","B","B","B")
)

data %>% 
  group_by(type, subtype) %>%
  summarise(x1 = paste(unique(var1),collapse ="|"),
            x2 = paste(unique(var2),collapse ="|")) %>%
  unite(xx,x1,x2) %>%
  spread(subtype,xx) %>%
  separate(`1`, c("1.var1","1.var2"), sep="_") %>%
  separate(`2`, c("2.var1","2.var2"), sep="_") %>%
  ungroup

# # A tibble: 2 x 5
#      type 1.var1 1.var2 2.var1   2.var2
#  * <fctr>  <chr>  <chr>  <chr>    <chr>
# 1      A      a     aa    b|c bb|cc|dd
# 2      B    d|e     ee      f       ff

当拥有数据表而不是数据框时,您可以使用相同的代码,甚至是脚本。但是,如果您正在寻找使用不同故事的数据表命令。