如何确定数据框中的更改?

时间:2017-03-14 12:54:31

标签: r

我想记录data.frame中的更改:

original数据框中,我有以下结构:

library(dplyr)
library(compare)

origianl <- data_frame( name = c('John','Tim','Allan','Mitch'),

                          A = c(0,1,1,0),

                          B = c(1,0,0,1),

                          C = c(0,0,0,0),

                          D = c(1,0,0,1))

#> data.frame(origianl)

   name           A           B           C           D
1  John           0           1           0           1
2   Tim           1           0           0           0
3 Allan           1           0           0           0
4 Mitch           0           1           0           1

这里约翰扮演B和D。

changed数据框中,我有以下结构:

changed <- data_frame( name = c('John','Tim','Allan','Mitch'),

                                A = c(1,0,1,0),

                                B = c(0,0,0,0),

                                C = c(0,1,0,1),

                                D = c(1,0,0,1))

#> data.frame(changed)
   name           A           B           C           D
1  John           1           0           0           1
2   Tim           0           0           1           0
3 Allan           1           0           0           0
4 Mitch           0           0           1           1

例如,John改变了球队,现在效力于A和D.

有几个限制因素:

我每天都会检查original。玩家可以为多个团队或一个团队玩,他们可以切换团队。他们被分配到至少一个团队。

我试过的是编写一个函数,记录两个特定列中的更改when from

所需的输出表可能如下所示,基本上是changed,其中有两个额外的列记录了更改:

#> data.frame(new_original)
   name           A           B           C           D  from    when
1  John           1           0           0           1   B     2017-01-01
2   Tim           0           0           1           0   A     2017-01-01
3 Allan           1           0           0           0   NA     NA
4 Mitch           0           0           1           1   B     2017-01-01

到目前为止我尝试过:

my_func <- function(origianl, changed) {
# Get the difference
 difference <- anti_join(changed, origianl)
# find the changes in the original data
  differ <- origianl[origianl$name %in% difference$name, ]

# check whether condition is satisfyied
if (nrow(differ) == nrow(difference) &&
  nrow(anti_join(changed, origianl)) > 0)

{
 new_original <-
  cbind(changed, from = names(which(
    compare(difference[2:ncol(difference)], differ[2:ncol(differ)], allowAll =
              TRUE)$detailedResult == FALSE, when = Sys.Date()
  )))

 }

 return(new_original)
}

new_original = my_func(origianl, changed)

哪个会产生arguments imply differing number of rows: 4, 3 来自cbind命令。不确定如何获得理想的输出

3 个答案:

答案 0 :(得分:4)

这是一个使用基数R的可能的矢量化方法。您还没有指定如何计算when列(这只是当前日期?),但这里是from非常容易计算的方式< / p>

indx <- which((origianl[-1] - changed[-1]) == 1, arr.ind = TRUE)
changed[indx[, "row"], "from"] <- names(changed)[-1][indx[, "col"]]
changed
#    name A B C D from
# 1  John 1 0 0 1    B
# 2   Tim 0 0 1 0    A
# 3 Allan 1 0 0 0 <NA>
# 4 Mitch 0 0 1 1    B

答案 1 :(得分:1)

假设when选择了Sys.time(),那么,

library(dplyr)

f1 <- function(x, y){
  d4 <-left_join(y, x, by = 'name')
  d5 <- d4[grepl('.x', names(d4), fixed = TRUE)]
  d6 <- d4[grepl('.y', names(d4), fixed = TRUE)]
  l1 <- apply(d5 - d6, 1, function(i) names(i)[i == -1])
  y$from <- sub('\\..*', '', unlist(ifelse(lengths(l1) == 0, NA, l1)))
  y$when <- Sys.time()
  y$when[is.na(y$from)] <- NA
  return(y)
}

f1(origianl, changed)

#   name A B C D from                when
#1  John 1 0 0 1    B 2017-03-14 15:37:59
#2   Tim 0 0 1 0    A 2017-03-14 15:37:59
#3 Allan 1 0 0 0 <NA>                <NA>
#4 Mitch 0 0 1 1    B 2017-03-14 15:37:59

答案 2 :(得分:0)

我不确定你是怎么得到的,但这应该适用于:

origianl <- as.data.frame(origianl)
id <- sapply(1:nrow(origianl), function(x) i[which( origianl[x,]==1)][!(i[which( origianl[x,]==1)] %in% i[which(changed[x,]==1)])])
origianl$from <- sapply(id,function(x) ifelse(length(x)==0,NA,x))