如何检查事件顺序是否正确?

时间:2018-07-24 10:59:54

标签: r date-comparison

我有一个数据表,其中的每一列代表一个事件:如果事件发生,则有一个日期值,如果没有发生,则为空。现在,所有事件都是可选的,但如果发生,则必须遵循一个顺序(A,然后是B,C ...)。

在研究数据时,我发现至少存在两个数据质量问题:事件A为空,事件B具有日期:或者事件A具有比事件B更晚的日期。我必须检查1000多个行中的10列,因此我想知道是否有一种方法可以使用R将其自动化(我只需要标记顺序是否正确,然后手动检查错误的情况)...我唯一想到的就是做了很多ifelse嵌套语句,这似乎根本不合适。

有人知道更好的功能/方法吗?在此先感谢您,以下是一些虚拟数据:(以下事件可以具有相同的日期)

> dput(Book1)
structure(list(ID = 1:20, A = structure(c(17532, NA, NA, 17226, 
17498, 17204, 17646, 17567, 17609, 17259, 17606, 17606, 17567, 
17612, 17612, 17612, 17395, 17687, 17612, 17687), class = "Date"), 
B = structure(c(17567, 17716, NA, 17259, 17562, NA, 17651, 
17606, 17612, 17226, NA, 17681, NA, NA, NA, NA, 17407, 17687, 
NA, 17716), class = "Date"), C = structure(c(NA, NA, NA, 
17260, NA, NA, NA, NA, 17614, NA, NA, 17687, NA, 17687, NA, 
NA, NA, NA, NA, 17716), class = "Date"), D = structure(c(NA, 
NA, NA, 17407, NA, NA, NA, NA, 17625, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA), class = "Date"), E = structure(c(NA, 
NA, NA, 17606, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA), class = "Date")), .Names = c("ID", "A", 
"B", "C", "D", "E"), row.names = c(NA, -20L), spec = structure(list(
cols = structure(list(ID = structure(list(), class = c("collector_integer", 
"collector")), A = structure(list(), class = c("collector_character", 
"collector")), B = structure(list(), class = c("collector_character", 
"collector")), C = structure(list(), class = c("collector_character", 
"collector")), D = structure(list(), class = c("collector_character", 
"collector")), E = structure(list(), class = c("collector_character", 
"collector"))), .Names = c("ID", "A", "B", "C", "D", "E")), 
default = structure(list(), class = c("collector_guess", 
"collector"))), .Names = c("cols", "default"), class = "col_spec"), class = 
 c("tbl_df", 
"tbl", "data.frame"))

因此,在此示例中,应标记第2、10和14行。

预先感谢

2 个答案:

答案 0 :(得分:2)

您可以使用apply()依次检查每一行,并使用sapply()检查该行中的每个元素。

假设您的数据框称为test_data,我们将添加一个新列,以显示每行中的日期列是否根据您指定的规则有意义。

test_data$valid <- apply(test_data[2:ncol(test_data)], 1, function (x) {

  # sapply iterates over each element in the row after the first one, checking 
  # all the previous elements
  valid <- sapply(2:length(x), function (y) {
    ifelse(
      !is.na(x[y]) # we can only check an element if it is a date
      & (
        # if any of the elements before the current one are NA, this is a 
        # problem
        sum(is.na(x[1:y-1]) > 0) | 
          # if any of the dates before the current one are greater than the 
          # current one, this is also a problem
          max(x[1:y-1]) > x[y]
      ), 
      FALSE, TRUE)
  })

  # if any of the elements in `valid` are false, this says there is a problem in
  # the data (note `valid` is shorter than `x` by one element because the first
  # element isn't checked against itself)
  ifelse(sum(valid) == length(x) - 1, TRUE, FALSE)

})

test_data[test_data$valid == FALSE,]

答案 1 :(得分:1)

我会在data.table中执行此操作,但是我确定dplyr的版本是相似的:

library(data.table)
setDT(DF) # <- convert to data.table
DF[DF[ , melt(.SD, id.vars = 'ID')
       ][ , {
         non_na_idx = which(!is.na(value))
         any(diff(value) < 0, na.rm = TRUE) || 
           (length(non_na_idx) && 
              max(non_na_idx) != length(non_na_idx))
       }, keyby = ID],
   flag := i.V1, on = 'ID'][]
#     ID          A          B          C          D          E  flag
#  1:  1 2018-01-01 2018-02-05       <NA>       <NA>       <NA> FALSE
#  2:  2       <NA> 2018-07-04       <NA>       <NA>       <NA>  TRUE
#  3:  3       <NA>       <NA>       <NA>       <NA>       <NA> FALSE
#  4:  4 2017-03-01 2017-04-03 2017-04-04 2017-08-29 2018-03-16 FALSE
#  5:  5 2017-11-28 2018-01-31       <NA>       <NA>       <NA> FALSE
#  6:  6 2017-02-07       <NA>       <NA>       <NA>       <NA> FALSE
#  7:  7 2018-04-25 2018-04-30       <NA>       <NA>       <NA> FALSE
#  8:  8 2018-02-05 2018-03-16       <NA>       <NA>       <NA> FALSE
#  9:  9 2018-03-19 2018-03-22 2018-03-24 2018-04-04       <NA> FALSE
# 10: 10 2017-04-03 2017-03-01       <NA>       <NA>       <NA>  TRUE
# 11: 11 2018-03-16       <NA>       <NA>       <NA>       <NA> FALSE
# 12: 12 2018-03-16 2018-05-30 2018-06-05       <NA>       <NA> FALSE
# 13: 13 2018-02-05       <NA>       <NA>       <NA>       <NA> FALSE
# 14: 14 2018-03-22       <NA> 2018-06-05       <NA>       <NA>  TRUE
# 15: 15 2018-03-22       <NA>       <NA>       <NA>       <NA> FALSE
# 16: 16 2018-03-22       <NA>       <NA>       <NA>       <NA> FALSE
# 17: 17 2017-08-17 2017-08-29       <NA>       <NA>       <NA> FALSE
# 18: 18 2018-06-05 2018-06-05       <NA>       <NA>       <NA> FALSE
# 19: 19 2018-03-22       <NA>       <NA>       <NA>       <NA> FALSE
# 20: 20 2018-06-05 2018-07-04 2018-07-04       <NA>       <NA> FALSE

apply风格的答案将强制将表强制转换为矩阵,这可能会带来一些意想不到的副作用(对于较大的示例,它会很慢),所以我选择重塑您的数据-我认为,以长数据形式解决您的问题要简单得多。

使用melt完成重塑:

DF[ , melt(.SD, id.vars = 'ID')]
#      ID variable      value
#   1:  1        A 2018-01-01
#   2:  2        A       <NA>
#   3:  3        A       <NA>
#   4:  4        A 2017-03-01
#   5:  5        A 2017-11-28
#   6:  6        A 2017-02-07
#   7:  7        A 2018-04-25
#   8:  8        A 2018-02-05
#   9:  9        A 2018-03-19
#  10: 10        A 2017-04-03
# < more rows here >
#  91: 11        E       <NA>
#  92: 12        E       <NA>
#  93: 13        E       <NA>
#  94: 14        E       <NA>
#  95: 15        E       <NA>
#  96: 16        E       <NA>
#  97: 17        E       <NA>
#  98: 18        E       <NA>
#  99: 19        E       <NA>
# 100: 20        E       <NA>
#      ID variable      value

您要寻找两个条件-

在任何行中,高一列(按字母顺序排列)中的日期都不应在低一列中的日期之前。在数据的长格式中,这意味着赢得每个ID的连续差异应单调增加,或者等效地,diff(value)始终为负数。因此,如果flag,我们的TRUEany(diff(value) < 0, na.rm = TRUE),这意味着至少有一个这样的差异对此ID为负:

DF[ , melt(.SD, id.vars = 'ID')
    ][ , any(diff(na.omit(value)) < 0, na.rm = TRUE), 
       keyby = ID]
#     ID    V1
#  1:  1 FALSE
# < omitted; all FALSE >
#  9:  9 FALSE
# 10: 10  TRUE # <- column B comes before column A
# 11: 11 FALSE
# < omitted; all FALSE >
# 20: 20 FALSE

列一旦“丢失”,则应“保持丢失”,这意味着观察值之间不应存在NA差距。这等效于说(a)该行中至少有一个非缺失值,并且(b)非缺失元素的数量与最高非缺失列的列号相同:

DF[ , melt(.SD, id.vars = 'ID')
    ][ , {
      non_na_idx = which(!is.na(value))
      length(non_na_idx) && max(non_na_idx) != length(non_na_idx)
    }, keyby = ID]
#     ID    V1
#  1:  1 FALSE
#  2:  2  TRUE # <- Column A missing, B not
#  3:  3 FALSE
# < omitted; all FALSE >
# 13: 13 FALSE
# 14: 14  TRUE # <- Column B missing, C not
# 15: 15 FALSE
# < omitted; all FALSE >
# 20: 20 FALSE

组合这两个条件以获取所有三行的标志。

最后,我们将新创建的标志返回到原始表,并创建一个名为flag的列。这可以分为两个步骤-使用flag列创建表,然后加入:

DF_with_flag = 
  DF[ , melt(.SD, id.vars = 'ID')
      ][ , {
        non_na_idx = which(!is.na(value))
        any(diff(na.omit(value)) < 0, na.rm = TRUE) || 
          (length(non_na_idx) && 
             max(non_na_idx) != length(non_na_idx))
      }, keyby = ID]
DF[DF_with_flag, flag := i.V1, on = 'ID']