Question

我有类似的数据。

B <- data.frame(State = c(rep("Arizona", 8), rep("California", 8), rep("Texas", 8)), 
  Account = rep(c("Balance", "Balance", "In the Bimester", "In the Bimester", "Expenses",  
  "Expenses", "In the Bimester", "In the Bimester"), 3), Value = runif(24))

您可以看到Account有4次出现的元素"In the Bimester"， 2＆＃34; chunks＆＃34; ，每个州有两个元素，{{1在他们之间。

这里的顺序很重要，因为第一个块并不是指与第二个块相同的东西。

我的数据实际上更复杂，它有一个第四个变量，表示每一行"Expenses"的含义。每个Account元素（因子本身）的元素数量可以更改。例如，在某些州，第一个＆＃34; chunk＆＃34; Account可以有6行，第二行7;但是，我不能通过这第四个变量来区分。

渴望：我希望对我的数据进行分组，将每个状态分成两个"In the Bimester"，仅对第一个＆＃34; chunk的行进行分组。由每个州或第二个＆＃34;块＆＃34;。

我有一个使用"In the Bimester"包的解决方案，但我发现它有点差。有什么想法吗？

data.table

Answer 1

您可以使用dplyr包：

library(dplyr)
B %>% mutate(helper = data.table::rleid(Account)) %>% 
      filter(Account == "In the Bimester") %>% 
      group_by(State) %>% filter(helper == min(helper)) %>% select(-helper)

# # A tibble: 6 x 3
# # Groups:   State [3]
#        State         Account      Value
#       <fctr>          <fctr>      <dbl>
# 1    Arizona In the Bimester 0.17730148
# 2    Arizona In the Bimester 0.05695585
# 3 California In the Bimester 0.29089678
# 4 California In the Bimester 0.86952723
# 5      Texas In the Bimester 0.54076144
# 6      Texas In the Bimester 0.59168138

如果您使用min代替max，则每个"In the Bimester"都会获得State的最后一次出现。您还可以通过将最后一个管道更改为Account来排除select(-helper,-Account)列。

ps 如果您不想使用rleid中的data.table并使用dplyr函数，请查看此{{3} }。

每个组的子集行基于列中的字符和数据框中出现的顺序

1 个答案: