匹配并提取仅具有至少一个超出指定范围

时间:2018-04-25 09:31:21

标签: r date dplyr subset diff

我的数据框有几个date列。

    Index Measurement       Date Measure.1     Date.1 Measure.2     Date.2 Measure.3     Date.3
1       1        56.0 2018-03-16         2 2018-03-23        12 2018-03-29      22.0 2018-04-05
2       2        56.0 2018-03-16        78 2018-03-23     41234 2018-03-29      12.0 2018-04-05
3      12        65.0       <NA>        54 2018-03-23        35       <NA>     323.0 2018-04-05
4      15       129.1 2018-03-16        78 2018-03-23        12 2018-03-29       2.0 2018-04-05
5      22        56.0 2018-03-16       786 2018-03-23       234 2018-03-29      21.0       <NA>
6     567          NA 2018-03-16        34 2018-03-23         4 2018-03-29     545.0 2018-04-21
7      75         5.0 2018-03-16        52 2018-03-23         3 2018-03-29       5.0 2018-04-05
8     563        12.0 2018-03-16        43 2018-03-23        34 2018-03-29       5.0 2018-04-05
9     436        12.0 2018-03-16         3 2018-03-23       123 2018-03-29     213.0 2018-04-05
10  34533        56.0 2018-03-16        43 2018-03-23        32 2018-03-29       5.0 2018-04-25
11 234234        76.0 2018-03-16       234 2018-03-31       324 2018-05-06       5.0 2018-04-05
12   6643        76.0 2018-03-16        23 2018-03-23       123 2018-03-29       0.2 2018-04-11

以下是加载我的数据的代码(小样本):

structure(list(Index = c(1L, 2L, 12L, 15L, 22L, 567L, 75L, 563L, 
436L, 34533L, 234234L, 6643L), Measurement = c(56, 56, 65, 129.1, 
56, NA, 5, 12, 12, 56, 76, 76), Date = structure(c(17606, 17606, 
NA, 17606, 17606, 17606, 17606, 17606, 17606, 17606, 17606, 17606
), class = "Date"), Measure.1 = c(2L, 78L, 54L, 78L, 786L, 34L, 
52L, 43L, 3L, 43L, 234L, 23L), Date.1 = structure(c(17613, 17613, 
17613, 17613, 17613, 17613, 17613, 17613, 17613, 17613, 17621, 
17613), class = "Date"), Measure.2 = c(12L, 41234L, 35L, 12L, 
234L, 4L, 3L, 34L, 123L, 32L, 324L, 123L), Date.2 = structure(c(17619, 
17619, NA, 17619, 17619, 17619, 17619, 17619, 17619, 17619, 17657, 
17619), class = "Date"), Measure.3 = c(22, 12, 323, 2, 21, 545, 
5, 5, 213, 5, 5, 0.2), Date.3 = structure(c(17626, 17626, 17626, 
17626, NA, 17642, 17626, 17626, 17626, 17646, 17626, 17632), class = "Date")), .Names = c("Index", 
"Measurement", "Date", "Measure.1", "Date.1", "Measure.2", "Date.2", 
"Measure.3", "Date.3"), row.names = c(NA, -12L), class = "data.frame")

我需要以行方式在相邻的 Date列中查找,并且每个相邻日期单元格之间的差异应该不是超过9天且不少于3天

我可以通过以下方式实现这一目标:

diffdate_table <- df[ , grep( "Date" , names( df ) ) ] %>% rowwise() %>% diff.Date

上面代码的输出将是:

> diffdate_table 
    Date.1  Date.2   Date.3
1   7 days  6 days   7 days
2   7 days  6 days   7 days
3  NA days NA days  NA days
4   7 days  6 days   7 days
5   7 days  6 days  NA days
6   7 days  6 days  23 days
7   7 days  6 days   7 days
8   7 days  6 days   7 days
9   7 days  6 days   7 days
10  7 days  6 days  27 days
11 15 days 36 days -31 days
12  7 days  6 days  13 days

问题

如何在diffdate_table中计算出至少有一个差异超过9天且少于3的行中的Index(上述数据集中的一列)?

1 个答案:

答案 0 :(得分:0)

有趣的问题,加上我之前从未见过var result = myDemoInfos.Select(demoInfo => new { DemoInfo = demoInfo, // if the demoInfo has a non-null stat and a non-empty stat // order it by ascending StatInfo.CreatedDate, and take the first // otherwise use DateTime.MaxValue (Created in far future) CreationDate = ( (demoInfo.stat != null) && (demoInfo.stat.Any()) ? demoInfo.Stat .Select(statInfo => createdDate) .OrderBy(createdDate => createdDate) .First() : // you know there is a first, you just checked Any() DateTime.MaxValue, // if there is no First, take far future }) .OrderBy(item => item.CreationDate) .Select(item => item.DemoInfo); 。这是TickCount = ( (demoInfo.stat != null) && (demoInfo.stat.Any()) ? demoInfo.stat.Select(statInfo => statInfo.createdDate.TickCount).Min() : DateTime.MaxValue.TickCount, 两种diff.Date方法。两者都得到相同的结果,只是你想要处理长形或宽形的数据来做差异。

第一个版本遵循您设置的方式,但我必须做一些奇怪的步骤以确保索引没有被删除。可能有更好的方法来做到这一点。

第二个dplyr从一开始就变成一个长形状,使用gather,只是简单地减去日期。

然后两者都按索引进行分组,并根据您的条件计算差异数。希望有所帮助!

lag

reprex package(v0.2.0)创建于2018-04-25。

编辑:刚刚意识到您可能对每个索引在您的范围内具有差异的次数的计数感兴趣,仅在符合该条件的索引中。在这种情况下,您可以在library(tidyverse) diffs1 <- df %>% column_to_rownames("Index") %>% select_at(vars(starts_with("Date"))) %>% diff.Date() %>% rownames_to_column("Index") %>% mutate(Index = as.integer(Index)) %>% gather(key = date, value = diff, -Index) %>% filter(diff %>% between(3, 9)) %>% count(Index) %>% ungroup() %>% arrange(Index) diffs1 #> # A tibble: 10 x 2 #> Index n #> <int> <int> #> 1 1 3 #> 2 2 3 #> 3 15 3 #> 4 22 2 #> 5 75 3 #> 6 436 3 #> 7 563 3 #> 8 567 2 #> 9 6643 2 #> 10 34533 2 diffs1$Index #> [1] 1 2 15 22 75 436 563 567 6643 34533 diffs2 <- df %>% select_at(vars(Index, starts_with("Date"))) %>% gather(key = obs, value = date, -Index) %>% group_by(Index) %>% mutate(prev_date = lag(date)) %>% mutate(diff = date - prev_date) %>% filter(!is.na(diff)) %>% filter(diff %>% between(3, 9)) %>% summarise(n = n()) 之后停止并使用filter(diff %>% between(3, 9))获取唯一索引。