使用data.table通过分组变量查找更大或更小的值

时间:2017-10-27 05:02:28

标签: r dplyr data.table

我的源数据的数据相当于几个月,但在这些数据中,我只想比较预先指定月份的数据。

这是我的输入数据:

dput(mydf)
structure(list(Month = structure(c(1L, 2L, 1L, 2L, 3L, 1L, 2L, 
2L, 1L, 2L, 1L), .Label = c("Aug", "Oct", "Sep"), class = "factor"), 
    Pipe = c(3, 4, 5, 3, 2, 1, 3, 3, 4, NA, 5), Gp = structure(c(1L, 
    1L, 2L, 2L, 2L, 3L, 4L, 5L, 5L, 6L, 6L), .Label = c("A", 
    "B", "C", "D", "E", "F"), class = "factor")), .Names = c("Month", 
"Pipe", "Gp"), row.names = c(NA, -11L), class = "data.frame")

现在,在这三个月中,我只想比较以下变量指定的月份。

 This_month_to_compare <- "Oct"
  Last_Month_to_compare <- "Aug"

现在,对于给定的two months以及基于分组Gp,我想说明Pipe中的This_month_to_compare值是否大于Last month to compare中的pipe值。如果两个structure(list(Month = structure(c(1L, 2L, 1L, 2L, 3L, 1L, 2L, 2L, 1L, 2L, 1L), .Label = c("Aug", "Oct", "Sep"), class = "factor"), Pipe = c(3, 4, 5, 3, 2, 1, 3, 3, 4, NA, 5), Gp = structure(c(1L, 1L, 2L, 2L, 2L, 3L, 4L, 5L, 5L, 6L, 6L), .Label = c("A", "B", "C", "D", "E", "F"), class = "factor"), Greater = c(NA, TRUE, NA, FALSE, NA, NA, NA, FALSE, NA, NA, NA)), .Names = c("Month", "Pipe", "Gp", "Greater"), row.names = c(NA, -11L), class = "data.frame") Month Pipe Gp Greater Explanation Aug 3 A Ignore: Aug Oct 4 A TRUE 4 > 3 Aug 5 B Ignore: Aug Oct 3 B FALSE 3< 5 Sep 2 B Ignore: Sep Aug 1 C Ignore: Aug Oct 3 D There is nothing to compare with Oct 3 E FALSE 3<4 Aug 4 E Ignore: Aug Oct F Cannot compare NA with 5 Aug 5 F Ignore: Aug 值中的一个不存在,我们将其留空。

这是输出的样子(手动创建,因为我没有成功使用代码)

mydfi<-data.table::as.data.table(mydfi)
  mydf<-mydfi
  #Method 1: Convert to Wide Format
  #Convert to wide format
  mydf<-data.table::dcast(mydf,Gp ~ Month, value.var = "Pipe")
  #Compare
  mydf$Growth<-mydf[[This_month_to_compare]]>mydf[[Last_Month_to_compare]]
  #Back to long format
  Melt_columns<-c("Aug","Oct","Sep")
  mydf<-data.table::melt(mydf, measure.vars =Melt_columns,variable.name = "Month", value.name = "Pipe")
  mydfo<-mydf[mydfi,on=c("Month","Gp","Pipe")]
  mydfo[Month!=This_month_to_compare,"Growth"]<-NA

我手动添加了上述说明。

我确实尝试过编码,这是我的尝试:

odoo.define('Modulename.filename', function (require) {
"use strict";

var form_widget = require('web.form_widgets');
var core = require('web.core');
var _t = core._t;
var QWeb = core.qweb;

form_widget.WidgetButton.include({
    on_click: function() {
         if(this.node.attrs.custom === "click"){
            //code//
         }
         this._super();
    },
});
});

更新:我只需添加左连接即可解决上述问题。我已经更新了上面的代码。但是,我正在寻找这些方面的解决方案:Calculate difference between values in consecutive rows by group

原因是我的实际数据集很大,不允许连接。

非常感谢任何帮助。提前谢谢。

2 个答案:

答案 0 :(得分:1)

这是你在想什么?

onRendered

如果需要,您可以简化代码以避免上述两个> library(data.table) > mydf <- data.table(mydf) > This_month_to_compare <- "Oct" > Last_Month_to_compare <- "Aug" > setkey(mydf, Gp, Month) > > # Make dummy table to join with > mydf[ + , Pipe_this := .SD[Month == This_month_to_compare, Pipe], by = "Gp"][ + , Pipe_last := .SD[Month == Last_Month_to_compare, Pipe], by = "Gp"][ + , `:=`( + Greater = Pipe_last < Pipe_this, Pipe_last = NULL, Pipe_this = NULL)][ + Month != "Oct", Greater := NA] > mydf Month Pipe Gp Greater 1: Aug 3 A NA 2: Oct 4 A TRUE 3: Aug 5 B NA 4: Oct 3 B FALSE 5: Sep 2 B NA 6: Aug 1 C NA 7: Oct 3 D NA 8: Aug 4 E NA 9: Oct 3 E FALSE 10: Aug 5 F NA 11: Oct NA F NA 来电,并避免定义[.data.tablePipe_this

答案 1 :(得分:1)

这可以通过两个连接来实现。第一个过滤掉要比较的月份,并根据需要对它们进行排序。然后可以进行比较。第二个连接将结果附加到原始数据框。

library(data.table)
# Last_Month_to_compare, This_month_to_compare
months_to_compare <- c("Aug", "Oct")
mDT <- setDT(mydf)[
  # append row id column (to preserve original order)
  , rn := .I][
    # cross join of groups and months
    CJ(Gp = Gp, Month = months_to_compare, unique = TRUE), on = .(Gp, Month)][
      # groupwise comparison of the two months
      , Greater := Pipe > shift(Pipe), by = Gp][]
# appending result to original data frame by joining with intermediate result
mydf[mDT, on = .(rn), Greater := i.Greater][]
    Month Pipe Gp rn Greater
 1:   Aug    3  A  1      NA
 2:   Oct    4  A  2    TRUE
 3:   Aug    5  B  3      NA
 4:   Oct    3  B  4   FALSE
 5:   Sep    2  B  5      NA
 6:   Aug    1  C  6      NA
 7:   Oct    3  D  7      NA
 8:   Oct    3  E  8   FALSE
 9:   Aug    4  E  9      NA
10:   Oct   NA  F 10      NA
11:   Aug    5  F 11      NA

请注意保留mydf的原始顺序。

中间结果mDT看起来像

    Month Pipe Gp rn Greater
 1:   Aug    3  A  1      NA
 2:   Oct    4  A  2    TRUE
 3:   Aug    5  B  3      NA
 4:   Oct    3  B  4   FALSE
 5:   Aug    1  C  6      NA
 6:   Oct   NA  C NA      NA
 7:   Aug   NA  D NA      NA
 8:   Oct    3  D  7      NA
 9:   Aug    4  E  9      NA
10:   Oct    3  E  8   FALSE
11:   Aug    5  F 11      NA
12:   Oct   NA  F 10      NA

编辑:补充说明

OP要求解释mydf[mDT, on = .(rn)]mydf[mDT, on = .(rn), Greater := i.Greater][]之间的区别。

使用data.tableX[Y, on = ...]右外连接,相当于merge(X, Y, all.y = TRUE),即返回Y的所有行(见JOINing data in R using data.table)。所以,

mydf[mDT, on = .(rn)]

返回

    Month Pipe Gp rn i.Month i.Pipe i.Gp Greater
 1:   Aug    3  A  1     Aug      3    A      NA
 2:   Oct    4  A  2     Oct      4    A    TRUE
 3:   Aug    5  B  3     Aug      5    B      NA
 4:   Oct    3  B  4     Oct      3    B   FALSE
 5:   Aug    1  C  6     Aug      1    C      NA
 6:    NA   NA NA NA     Oct     NA    C      NA
 7:    NA   NA NA NA     Aug     NA    D      NA
 8:   Oct    3  D  7     Oct      3    D      NA
 9:   Aug    4  E  9     Aug      4    E      NA
10:   Oct    3  E  8     Oct      3    E   FALSE
11:   Aug    5  F 11     Aug      5    F      NA
12:   Oct   NA  F 10     Oct     NA    F      NA

i.前缀的列来自mDT。请注意,第6行和第7行在mydf中没有匹配的行。此外,行的顺序由mDT中的顺序确定。

如果mydfmDT互换,

mDT[mydf, on = .(rn)][]

返回

    Month Pipe Gp rn Greater i.Month i.Pipe i.Gp
 1:   Aug    3  A  1      NA     Aug      3    A
 2:   Oct    4  A  2    TRUE     Oct      4    A
 3:   Aug    5  B  3      NA     Aug      5    B
 4:   Oct    3  B  4   FALSE     Oct      3    B
 5:    NA   NA NA  5      NA     Sep      2    B
 6:   Aug    1  C  6      NA     Aug      1    C
 7:   Oct    3  D  7      NA     Oct      3    D
 8:   Oct    3  E  8   FALSE     Oct      3    E
 9:   Aug    4  E  9      NA     Aug      4    E
10:   Oct   NA  F 10      NA     Oct     NA    F
11:   Aug    5  F 11      NA     Aug      5    F

i.前缀的列现在来自mydf。请注意,第5行在mDT中没有匹配项。此外,行的顺序由mydf确定。

使用赋值运算符:=X[Y, on = ..., a := b]成为左内连接,其中包含原始顺序中的所有X行。因此,

mydf[mDT, on = .(rn), Greater := i.Greater][]

返回

    Month Pipe Gp rn Greater
 1:   Aug    3  A  1      NA
 2:   Oct    4  A  2    TRUE
 3:   Aug    5  B  3      NA
 4:   Oct    3  B  4   FALSE
 5:   Sep    2  B  5      NA
 6:   Aug    1  C  6      NA
 7:   Oct    3  D  7      NA
 8:   Oct    3  E  8   FALSE
 9:   Aug    4  E  9      NA
10:   Oct   NA  F 10      NA
11:   Aug    5  F 11      NA

其中Greater对于不匹配的行变为NA