Question

我的数据集包含user，time和condition。我想替换以FALSE开头的序列的时间，然后使用最后一个连续TRUE的{{1}}连续两次以上time。

让我们说TRUE

df:

我想要的结果：rownumber 6的时间被复制到rownumber 3到6的时间，因为连续df <- read.csv(text="user,time,condition 11,1:05,FALSE 11,1:10,TRUE 11,1:10,FALSE 11,1:15,TRUE 11,1:20,TRUE 11,1:25,TRUE 11,1:40,FALSE 22,2:20,FALSE 22,2:30,FALSE 22,2:35,TRUE 22,2:40,TRUE", header=TRUE)从4到6开始。这同样适用于最后三个记录。

TRUE

我怎么能在R？

中做到这一点

Answer 1

这是使用rle

的一个选项

## Run length encoding of df
df_rle <- rle(df$condition)
## Locations of 2 or more consecutive TRUEs in RLE
seq_changes <- which(df_rle$lengths >= 2 & df_rle$value == TRUE)
## End-point index in original data frame
df_ind <- cumsum(df_rle$lengths)

## Loop over breakpoints to change
for (i in seq_changes){
  i1 <- df_ind[i-1]
  i2 <- df_ind[i]
  df$time[i1:i2] <- df$time[i2]
}

Answer 2

此解决方案应该可以解决问题，请参阅代码中的注释以获取更多详细信息

false_positions <- which(!c(df$condition, FALSE)) #Flag the position of each of the FALSE occurences
                                                  #A dummy FALSE is put on the end to check for end of dataframe

false_differences <- diff(false_positions, 1)     #Calculate how far each FALSE occurence is from the last

false_starts <- which(false_differences > 2)      #Grab which of these FALSE differences are more than 2 apart
                                                  #Greater than 2 indicates 2 or more TRUEs as the first FALSE 
                                                  #counts as one position

#false_starts stores the beginning of each chain we want to update

#Go through each of the FALSE starts which have more than one consecutive TRUE
for(false_start in false_starts){

  false_first <- false_positions[false_start]     #Gets the position of the start of our chain

  true_last <- false_positions[false_start+1]-1   #Gets the position of the end of our chain, which is the
                                                  #the item before (thus the -1) the false after our
                                                  #initial FALSE (thus the +1)

  time_override <- df$time[true_last]             #Now we know the position of the end of our chain (the last TRUE)
                                                  #We can get the time we want to use

  df$time[false_first:true_last] <- time_override #Update all the times from the start to end of our chain with
                                                  #the time we just determined

}

> df
   user time condition
1    11 1:05     FALSE
2    11 1:10      TRUE
3    11 1:25     FALSE
4    11 1:25      TRUE
5    11 1:25      TRUE
6    11 1:25      TRUE
7    11 1:40     FALSE
8    22 2:20     FALSE
9    22 2:40     FALSE
10   22 2:40      TRUE
11   22 2:40      TRUE

如果可能的话，我想将底部循环并行化，但在我的头脑中，我很难这样做。

要点是确定我们所有的愚蠢行为，然后确定我们所有链条的起点，因为我们只有TRUE和FALSE，我们可以通过查看我们的FALSE有多远来做到这一点！

一旦我们知道我们的链在哪里开始（因为它们是FALSE相距足够远的第一个FALSE），我们可以通过查看我们已经创建的所有FALSES列表中下一个FALSE之前的元素来获得链的结束

现在我们有了链条的开头和结尾，我们可以看一下链的末尾以获得我们想要的时间，然后填写时间值！

我希望这可以提供一种相对快速的方式来做你想做的事情，虽然:)

循环并检查连续记录的条件并在R中替换

2 个答案: