使用sqldf在滞后窗口中的ID和最新日期上完全连接

时间:2019-07-12 17:47:10

标签: r sqldf

我想加入两个数据集function getReports() { $.ajax({ type: 'GET', url: BaseURL, headers: { "AuthKey": getCookie("token") }, success: function success(response) { console.log(response); reports = response; $("#reportTemplate").tmpl(response).appendTo("#chrtArea"); for (var i = 0; i < reports.length; i++) { chartDisplay(reports[i]) } }, error: function error() { alert('Error'); $(".overlay").hide(); } }); } function chartDisplay(row) { graphs[row.ReportID] = document.getElementById('chrt' + row.ReportID).getContext('2d') charts[row.ReportID] = new Chart(graphs[row.ReportID]) $.ajax({ type: 'GET', url: BaseURL + row.URL, headers: { "AuthKey": getCookie("token") }, success: function success(response) { console.log(response); chartData[row.ReportID] = { labels: response.Label, datasets: [{ label: 'Before SLA', backgroundColor: 'rgba(0, 255, 0, 0.5)', stack: 'Stack 0', data: response.beforeSLA }, { label: 'On SLA', backgroundColor: 'rgba(255,140,0, 0.5)', stack: 'Stack 0', data: response.onSLA }, { label: 'After SLA', backgroundColor: 'rgba(255, 0, 0, 0.5)', stack: 'Stack 0', data: response.afterSLA }] }; charts[row.ReportID] = new Chart(graphs[row.ReportID], { type: 'bar', data: chartData[row.ReportID], scaleOverride : true, scaleSteps : 10, scaleStepWidth : 10, scaleStartValue : 0, options: { title: { display: true, text: 'SLA (%)' }, tooltips: { mode: 'index', intersect: false }, responsive: true, scales: { xAxes: [{ stacked: true, }], yAxes: [{ stacked: true, ticks : { max : 100, stepValue: 10, min : 0 } }] } } }) }, error: function error() { //alert('Error'); $(".overlay").hide(); } }); } A。我想将BA的变量准确地加入B中,但是只保留id中三个月到三年之间的最新观察结果。 / p>

数据集足够大,我需要使用B包(sqldf中约500,000行,A中250,000行)。似乎逻辑应该是BLEFT OUTER JOIN A AND BA.id = B.id,然后依次是(A.date - B.date) BETWEEN 3*30 AND 3*365GROUP BY A.row,然后继续观察。但是我的下面代码将总体上保留了第一个观察值,而不是每个ORDER BY B.date DESC组的第一个观察值。

我可以分两个步骤进行此操作(一个A.row,一个sqldf),但是tidyverse可以同时执行两个步骤吗?

sqldf

reprex package(v0.3.0)于2019-07-12创建

2 个答案:

答案 0 :(得分:3)

考虑诸如RANK()之类的窗口函数,其中可能采用dplyr::row_number()(在其他SQL语义中,例如selectgroup_bycase_when)。 SQLite(sqldf的默认方言)最近在版本3.25.0(2018年9月发行)中添加了对window functions的支持。

如果sqldf中不可用(取决于版本),请通过RPostgreSQL使用Postgres后端。见作者docs。可能太早了,RMySQL将成为另一个受支持的后端,因为MySQL 8最近增加了对窗口函数的支持。

library(RPostgreSQL)
library(sqldf)

D <- sqldf('WITH cte AS
               (SELECT *,
                       RANK() OVER (PARTITION BY "B".row ORDER BY "B".date DESC) AS rn
                FROM "A"
                LEFT JOIN "B"
                    ON "A".id = "B".id
                   AND ("A".date - "B".date) BETWEEN 3*30 and 3*365
               )

           SELECT * FROM cte
           WHERE rn = 1')

答案 1 :(得分:1)

在SQLite中,如果您在max中使用mingroup by,那么将使用整行:

sqldf('SELECT 
    A.rowid as A_row, 
    A.id, 
    A.subid, 
    A.date as A_date__Date, 
    max(B.rowid) as B_row, 
    B.date as B_date__Date, 
    B.x
  FROM A
  LEFT OUTER JOIN B ON A.id = B.id AND (A.date - B.date) BETWEEN 3*30 AND 3*365
  GROUP BY A.rowid
  ', method = "name__class")

给予:

   A_row id subid     A_date B_row     B_date         x
1      1  1     1 2019-01-01     4 2018-05-01 0.8304476
2      2  1     2 2019-01-01     4 2018-05-01 0.8304476
3      3  2     1 2019-01-01    14 2018-05-01 0.2554288
4      4  2     2 2019-01-01    14 2018-05-01 0.2554288
5      5  3     1 2019-01-01    24 2018-05-01 0.9466682
6      6  3     2 2019-01-01    24 2018-05-01 0.9466682
7      7  4     1 2019-01-01    34 2018-05-01 0.6851697
8      8  4     2 2019-01-01    34 2018-05-01 0.6851697
9      9  5     1 2019-01-01    44 2018-05-01 0.9735399
10    10  5     2 2019-01-01    44 2018-05-01 0.9735399
11    11  6     1 2019-01-01    54 2018-05-01 0.7846928
12    12  6     2 2019-01-01    54 2018-05-01 0.7846928
13    13  7     1 2019-01-01    64 2018-05-01 0.5664884
14    14  7     2 2019-01-01    64 2018-05-01 0.5664884
15    15  8     1 2019-01-01    74 2018-05-01 0.4793986
16    16  8     2 2019-01-01    74 2018-05-01 0.4793986
17    17  9     1 2019-01-01    84 2018-05-01 0.6456319
18    18  9     2 2019-01-01    84 2018-05-01 0.6456319
19    19 10     1 2019-01-01    94 2018-05-01 0.9330341
20    20 10     2 2019-01-01    94 2018-05-01 0.9330341