按时间阈值过滤行

时间:2018-02-27 20:35:42

标签: r dataframe time filtering

我有一个以这种方式组织的数据集:

ID   Species       DateTime
P1   A             2015-03-16 18:42:00
P2   A             2015-03-16 19:34:00
P3   A             2015-03-16 19:58:00
P4   A             2015-03-16 21:02:00
P5   B             2015-03-16 21:18:00
P6   A             2015-03-16 21:19:00
P7   A             2015-03-16 21:33:00
P8   B             2015-03-16 21:35:00
P9   B             2015-03-16 23:43:00

我希望在每个物种中为每个物种选择独立的图片(即图片彼此分开1小时),在此数据集中为R.

在这个例子中,对于物种A,我只想保留P1,P3和P4。 P2不会被考虑,因为它落在以P1开始的1h时段内。 P3被认为是因为其DateTime(19h58)在19h42之后下降。现在,接下来的1h时段将持续到20h58。对于物种B,只有P5和P9。

因此,在此过滤器之后,我的数据集将如下所示:

ID   Species       DateTime
P1   A             2015-03-16 18:42:00
P3   A             2015-03-16 19:58:00
P4   A             2015-03-16 21:02:00
P5   B             2015-03-16 21:18:00
P9   B             2015-03-16 23:43:00

有人知道如何在R中执行此操作吗?

4 个答案:

答案 0 :(得分:1)

可能有一种更优雅的方式,但这有效:

library(dplyr)

isHourApart <- function(dt) {
    min <- 0
    keeps <- c()
    for (d in dt) {
        if (d >= min + 60 * 60) {
            min <- d
            keeps <- c(keeps, TRUE)
        } else {
            keeps <- c(keeps, FALSE)
        }
    }
    keeps
}


df %>% 
    group_by(Species) %>% 
    filter(isHourApart(DateTime))

> df
# A tibble: 5 x 3
# Groups:   Species [2]
  ID    Species DateTime           
  <chr> <fct>   <dttm>             
1 P1    A       2015-03-16 18:42:00
2 P3    A       2015-03-16 19:58:00
3 P4    A       2015-03-16 21:02:00
4 P5    B       2015-03-16 21:18:00
5 P9    B       2015-03-16 23:43:00

请注意,DateTime列属于POSIXct类。

答案 1 :(得分:1)

以下是dplyr解决方案:

require(dplyr);
df %>%
    arrange(Species, DateTime) %>%
    group_by(Species) %>%
    mutate(
        DateTime = as.POSIXct(DateTime),
        diff = abs(lag(DateTime) - DateTime),
        diff = ifelse(is.na(diff), 0, diff),
        cumdiff = cumsum(as.numeric(diff)) %/% 60,
        x = abs(lag(cumdiff) - cumdiff)) %>%
    filter(is.na(x) | x > 0) %>%
    select(ID, Species, DateTime) %>%
    ungroup() %>%
    as.data.frame()
#  ID Species            DateTime
#1 P1       A 2015-03-16 18:42:00
#2 P3       A 2015-03-16 19:58:00
#3 P4       A 2015-03-16 21:02:00
#4 P5       B 2015-03-16 21:18:00
#5 P9       B 2015-03-16 23:43:00

样本数据

df <- read.table(text = "ID   Species       DateTime
P1   A             '2015-03-16 18:42:00'
P2   A             '2015-03-16 19:34:00'
P3   A             '2015-03-16 19:58:00'
P4   A             '2015-03-16 21:02:00'
P5   B             '2015-03-16 21:18:00'
P6   A             '2015-03-16 21:19:00'
P7   A             '2015-03-16 21:33:00'
P8   B             '2015-03-16 21:35:00'
P9   B             '2015-03-16 23:43:00'", header = T);

答案 2 :(得分:1)

以下是使用 enter code here# -*- coding: utf-8 -*- from __future__ import unicode_literals from django.shortcuts import render, redirect from django.contrib import messages import bcrypt from .models import * # Create your views here. def index(request): #appt=Appt.objects context ={ #"appts": appt } return render(request,'index.html',context) def register(request): errors = User.objects.validate(request.POST) #print 'this process works', request.POST if len(errors) > 0: for error in errors: messages.error(request, error) return redirect("/") else: hashpwd = bcrypt.hashpw(request.POST["password"].encode(), bcrypt.gensalt()) newuser = User.objects.create( first_name=request.POST['first_name'], last_name=request.POST['last_name'], email=request.POST['email'], password=hashpwd) request.session['userid'] = newuser.id request.session['name'] = newuser.first_name print "session info", newuser.id, newuser.first_name return redirect("/success") def login(request): # print postData['email'] errors = User.objects.login(request.POST) if len(errors) > 0: for error in errors: messages.error(request, error) return redirect("/") else: user = User.objects.filter(email=request.POST['email'])[0] request.session['userid'] = user.id request.session['name'] = user.first_name return redirect("/success") def success(request): user = request.session['userid'] return render(request, 'appointments.html') def logout(request): request.session.clear() #print 'goodbye' return redirect('/') def new(request): request.session return render(request,'newappoint.html') def delete(request): return (request,'destroy') def edit(request): return render(request,'edit.html') def create(request): return redirect('edit.html') def update(request): return render(request,'newappoint.html')

进行此操作的一种方法
data.table

答案 3 :(得分:0)

我们可以简单地创建一个间隔为60分钟的新列,然后保留每个Species的第一次出现。

df %>%
  mutate(by60 = cut(DateTime, "60 min")) %>%
  group_by(Species, by60) %>%
  slice(1)

输出1

# A tibble: 5 x 4
# Groups:   Species, by60 [5]
  ID    Species DateTime            by60               
  <chr> <chr>   <dttm>              <fct>              
1 P1    A       2015-03-16 18:42:00 2015-03-16 18:42:00
2 P3    A       2015-03-16 19:58:00 2015-03-16 19:42:00
3 P4    A       2015-03-16 21:02:00 2015-03-16 20:42:00
4 P5    B       2015-03-16 21:18:00 2015-03-16 20:42:00
5 P9    B       2015-03-16 23:43:00 2015-03-16 23:42:00

如果我们想删除虚拟列:

df %>%
  mutate(by60 = cut(DateTime, "60 min")) %>%
  group_by(Species, by60) %>%
  slice(1) %>% 
  ungroup() %>% 
  select(-by60)

<强>输出2

# A tibble: 5 x 3
  ID    Species DateTime           
  <chr> <chr>   <dttm>             
1 P1    A       2015-03-16 18:42:00
2 P3    A       2015-03-16 19:58:00
3 P4    A       2015-03-16 21:02:00
4 P5    B       2015-03-16 21:18:00
5 P9    B       2015-03-16 23:43:00