R dplyr没有完成滞后日期差计算

时间:2015-06-29 22:26:26

标签: r dplyr

我有一个数据框,如:

bp <- bp %>% group_by(accountId) %>%
  mutate(diff = as.numeric(date - lag(date)))

它有340万行数据:

<HTML>
<body>
<head>
<script language=“javascript”>
var button = document.getElementById('test');
var date = document.getElementById('1');
var contact = document.getElementById('2');
var contacttype = document.getElementById('3');
var os = document.getElementById('4');
var devicetype = document.getElementById('5');
var device = document.getElementById('6');
var reason = document.getElementById('7');
var comments = document.getElementById('8');

button.onclick = function () {
    var str = "Date: " + date.value + "   " + "Contact: " + contact.value + "   " + "Insured or Agent: " + contacttype.value + "   " + "Operating System: " + os.value + "   " + "Tablet or Phone: " + devicetype.value + "   " + "Device Name: " + device.value + "   " + "Reason fo Call: " + reason.value + "   " + "Additional Comments: " + comments.value;
    alert(str);

};
</script>
</head>
<h1> SR Template
</h1>
<label>Date:
    <input id="1" />
</label>
<br />
<label>Contact:
    <input id="2" />
</label>
<br>
<label>Insured or Agent:
    <input id="3" />
</label>
<br>
<label>Operating System:
    <input id="4" />
</label>
<br>
<label>Tablet or Phone:
    <input id="5" />
</label>
    <br>
    <label>Device Name:
    <input id="6" />
</label>
        <br>
    <label>Reason for call:
    <input id="7" />
</label>
            <br>
    <label>Additional Comments:
    <input id="8" />
</label>

<br />
<button id="test">Test</button>


</body>
</HTML>

我正在尝试使用dplyr计算滞后时间差异,如下所示:

{{1}}

在我的8GB内存macbook上,R崩溃了。在64GB的Linux服务器上,代码将永远存在。有关解决此问题的任何想法吗?

1 个答案:

答案 0 :(得分:2)

不知道你的方式出了什么问题,但是date作为一个正确的Date对象,一切都在这里很快发生:

重新创建一些数据:

dat <- read.table(text="        date amount accountId type
1 2015-06-11  101.2         1    a
2 2015-06-18  101.2         1    a
3 2015-06-24  101.2         1    b
4 2015-06-11  294.0         2    a
5 2015-06-18   48.0         2    a
6 2015-06-26   10.0         2    b",header=TRUE)
dat$date <- as.Date(dat$date)

然后在3.4M行,1000组上运行一些分析:

set.seed(1)
dat2 <- dat[sample(rownames(dat),3.4e6,replace=TRUE),]
dat2$accountId <- sample(1:1000,3.4e6,replace=TRUE)
nrow(dat2)
#[1] 3400000
length(unique(dat2$accountId))
#[1] 1000

system.time({
dat2 <- dat2 %>% group_by(accountId) %>%
  mutate(diff = as.numeric(date - lag(date)))
})
#  user  system elapsed 
#  0.38    0.03    0.40 

head(dat2[dat2$accountId==46,])
#Source: local data frame [6 x 6]
#Groups: accountId
#
#        date amount accountId type diff
#1 2015-06-24  101.2        46    b   NA
#2 2015-06-18   48.0        46    a   -6
#3 2015-06-11  294.0        46    a  -13
#4 2015-06-18  101.2        46    a    7
#5 2015-06-26   10.0        46    b    2
#6 2015-06-11  294.0        46    a    0