Question

我有一个像这样的数据框：

tdf <- structure(list(indx = c(1, 1, 1, 2, 2, 3, 3), group = c(1, 1, 
2, 1, 2, 1, 1)), .Names = c("indx", "group"), row.names = c(NA, 
-7L), class = "data.frame")

数据框如下：

   indx group
1    1     1
2    1     1
3    1     2
4    2     1
5    2     2
6    3     1
7    3     1

我想遍历该组，并将第一个索引的组值保留为所需的输出

对于第一个索引之后的indx值的每一个增量，我想添加上一个indx的组的最大值，并希望从第二个城市开始递增该组的值。

所需的输出是这样的：

    indx group    desiredOutput
1    1     1             1
2    1     1             1
3    1     2             2
4    2     1             3
5    2     2             4
6    3     1             5
7    3     1             5

为清楚起见，我将数据帧分割如下：

    indx group    desiredOutput
1    1     1             1
2    1     1             1       To be retained as is
3    1     2             2


4    2     1             3       Second index-the max value of desiredOutput in indx1 is 2                   
5    2     2             4       I want to add this max value to the group value in indx 2       


6    3     1             5       Similarly, the max value of des.out of indx2 is 4
7    3     1             5       Adding the max value to group provides me new values

我尝试将此数据帧拆分为数据帧列表，并迭代其中的每个帧。

ndf <- split(tdf,f = tdf$indx)
x <- 0
for (i in seq_along(ndf)){
    ndf[[i]]$ng <- ndf[[i]]$group+x
    x <- max(ndf[[i]]$indx) + 1
}
ndf

上面的代码更新了第二个索引的值，但是在到达第三个索引时失败了。

Answer 1

首先，找到每个索引的最大组值，然后计算这些组的累积总和。

library(dplyr)

maxGroupVals <- tdf %>% 
  group_by(indx) %>% 
  summarise(maxVal = max(group)) %>% 
  mutate(indx = indx + 1, maxVal = cumsum(maxVal))

为索引加1，因为这是将这些最大值添加到的索引。连接数据框将为您提供目标增长列。然后，它是一个简单的带有条件语句的mutate，用于处理index = 1的情况。

tdf %>% 
  left_join(maxGroupVals) %>% 
  mutate(desiredOutput = if_else(indx == 1, group, group + maxVal)) %>% 
  select(-maxVal)

如果需要，请丢弃中间计算列。

Answer 2

dplyr版本1.0.1具有功能cur_group_id()，该功能完全可以满足您的需求。在dplyr, the group_indices`的早期版本中，功能是您想要的：

library(dplyr)
tdf %>% group_by(indx, group) %>%
  mutate(desiredOutput = cur_group_id()) %>%
  ungroup()

Answer 3

考虑合并两列，然后转换为因数，然后转换为整数。因子级别由unique设置，以避免字母顺序或数字顺序，但保留原始数据帧中的顺序。

tdf <- within(tdf, {
    tmp <- paste(indx, group, sep="&")    
    new_indx <- as.integer(factor(tmp, levels=unique(tmp)))
    rm(tmp)    
})

tdf
#   indx group new_indx
# 1    1     1        1
# 2    1     1        1
# 3    1     2        2
# 4    2     1        3
# 5    2     2        4
# 6    3     1        5
# 7    3     1        5

Answer 4

要获得唯一的indx /组组合的运行计数，您可以简单地（对预先排序的数据）进行操作：

tdf$desiredOutput <- cumsum(!duplicated(tdf))

哪个给：

  indx group desiredOutput
1    1     1             1
2    1     1             1
3    1     2             2
4    2     1             3
5    2     2             4
6    3     1             5
7    3     1             5

遍历数据框并根据条件

4 个答案: