Question

我有这个数据框：

df <- data.frame(
  id = rep(1:4, each = 4), 
  status = c(
    NA, "a", "c", "a", 
    NA, "b", "c", "c",
    NA, NA, "a", "c",
    NA, NA, "b", "b"), 
  stringsAsFactors = FALSE)

对于每个组（id），我的目标是删除在“a”前面有一个或多个前导NA的行（在“状态”列中）但不在“b”前面。

最终数据框应如下所示：

structure(list(
  id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 4L, 4L), 
  status = c("a", "c", "a", NA, "b", "c", "c", "a", "c", NA, NA, "b", "b")), 
  .Names = c("id", "status"), row.names = c(NA, -13L), class = "data.frame")

我该怎么做？

编辑：或者，如何在数据框中保留其他变量，例如以下示例中的变量otherVar：

df2 <- data.frame(
   id = rep(1:4, each = 4), 
   status = c(
    NA, "a", "c", "a", 
    NA, "b", "c", "c",
    NA, NA, "a", "c",
    NA, NA, "b", "b"),
  otherVar = letters[1:16],
  stringsAsFactors = FALSE)

Answer 1

我们可以按{id'，summarise'状态'将paste元素组合在一起，然后使用gsub删除'{1}}之前的'{1}} '并使用NA

将其转换为'long'格式

separate_rows

或使用library(dplyr) library(tidyr) df %>% group_by(id) %>% summarise(status = gsub("(NA, ){1,}(?=a)", "", toString(status), perl = TRUE)) %>% separate_rows(status, convert = TRUE) # A tibble: 13 x 2 # id status # <int> <chr> # 1 1 a # 2 1 c # 3 1 a # 4 2 NA # 5 2 b # 6 2 c # 7 2 c # 8 3 a # 9 3 c #10 4 NA #11 4 NA #12 4 b #13 4 b使用相同的方法

data.table

更新

对于更新的数据集'df2'

library(data.table)
out1 <- setDT(df)[, strsplit(gsub("(NA, ){1,}(?=a)", "", 
            toString(status), perl = TRUE), ", "), id]
setnames(out1, 'V1', "status")[]
#    id status
# 1:  1      a
# 2:  1      c
# 3:  1      a
# 4:  2     NA
# 5:  2      b
# 6:  2      c
# 7:  2      c
# 8:  3      a
# 9:  3      c
#10:  4     NA
#11:  4     NA
#12:  4      b
#13:  4      b

Answer 2

从zoo na.locf和is.na开始，请注意，假设您订购了数据。

df[!(na.locf(df$status,fromLast = T)=='a'&is.na(df$status)),]
   id status
2   1      a
3   1      c
4   1      a
5   2   <NA>
6   2      b
7   2      c
8   2      c
11  3      a
12  3      c
13  4   <NA>
14  4   <NA>
15  4      b
16  4      b

Answer 3

这是一个dplyr解决方案，而不是base翻译：

<强> dplyr

library(dplyr)
df %>% group_by(id) %>%
  filter(status[!is.na(status)][1]!="a" | !is.na(status))

# # A tibble: 13 x 2
# # Groups:   id [4]
#       id status
#    <int>  <chr>
#  1     1      a
#  2     1      c
#  3     1      a
#  4     2   <NA>
#  5     2      b
#  6     2      c
#  7     2      c
#  8     3      a
#  9     3      c
# 10     4   <NA>
# 11     4   <NA>
# 12     4      b
# 13     4      b

<强>碱

do.call(rbind,
        lapply(split(df,df$id),
               function(x) x[x$status[!is.na(x$status)][1]!="a" | !is.na(x$status),]))

#      id status
# 1.2   1      a
# 1.3   1      c
# 1.4   1      a
# 2.5   2   <NA>
# 2.6   2      b
# 2.7   2      c
# 2.8   2      c
# 3.11  3      a
# 3.12  3      c
# 4.13  4   <NA>
# 4.14  4   <NA>
# 4.15  4      b
# 4.16  4      b

注意

如果并非所有NAs都领先，则会失败，因为会从以NAs开头的组中删除所有"a"作为第一个非NA值。

删除一个特定字符串前面的NA，但按组分别留在另一个特定字符串前面

3 个答案:

更新