Question

我不知道如何做到这一点，所以从这个意义上讲，我的问题有点宽泛。我的真实数据集包含来自100多个人的数据，他们必须在4个时间点填写调查问卷。我的一些数据丢失了，我想知道的是：如果数据丢失，那个时间点的整个调查问卷是否缺失？或者可能是一个人只是'未能'在同一时间点回答1个问题或几个问题？下面的数据集中的Question从A到F（即6个问题）。

我想让它工作的示例代码：

ID <- rep(1:10, each = 24)
Question <- rep(LETTERS[1:6], 40)
Value <- round(runif(length(ID), 0, 5))
Time <- rep(c(0, 1, 3, 4), each = 6, times = 10)

df <- data.frame(ID, Question, Value, Time)
dfValue <- df[19:24, ]

df[19:24, ]$Value <- NA
df[28:30, ]$Value <- NA
df[49, ]$Value <- NA
df[55:61, ]$Value <- NA

正如您所看到的，我为某些NAs创建了IDs，其中2次未能填写整个调查问卷，有时他/她未能回答3个问题，而另一个人则没有未能回答1个问题。

到目前为止我尝试的是：

missing <- df[which(is.na(df$Value)), ]

这适用于我所提供的小数据集（并且没有太多NAs），但是一旦拥有大型数据集，它就会变得乏味。有没有办法实现同样的，但更方便？我自己的数据集返回一个包含569个观测值的数据框。眼睛有点多点。

为清晰起见：某些算法/代码会检查每个ID（或每Value == NA}是否同一Values的所有其他Time也是{{ 1}}或不是。这个想法是它将返回（基于上面的示例数据）：

NA

Answer 1

您可以使用tapply来测试all(is.na(x))，x分为df$Value和df$ID分割的df$Time的每一块：

tapply(df$Value, list(df$ID, df$Time), function(x) all(is.na(x)))

编辑（见评论）

tapply(df$Value, list(df$ID, df$Time), function(x) sum(is.na(x)) %in% 1:5)

Answer 2

您应该使用强大的data.table包。

library(data.table)
setDT(df)

# This will show in what time and ID, how many are missing
df[,missing := sum(is.na(Value)), .(Time,ID)]

# this will should all cases where the value is missing
df[missing != 0]

# next you can do simple aggregations to get answers like
# get cases where all values are missing
df[missing == 6]

#The second part of your question can be solved by subsetting this data.
#For example:
df[(ID == 1) & (is.na(Value))]

Answer 3

我会以下列方式使用dplyr库：

library(dplyr)

df_summarized <-
df %>% 
 # Creating Answered variable to detect if there is an NA in Value variable
 mutate(Answered = !is.na(Value)) %>% 
 # Group by ID and Time
 group_by(ID, Time) %>%
 # Sum the number of Answered for ID and Time 
 summarise(Num_Ans = sum(Answered))

返回如下表格：

    ID   Time   Num_Ans
   <int> <dbl>   <int>
 1     1     0       6
 2     1     1       6
 3     1     3       6
 4     1     4       0
 5     2     0       3
 6     2     1       6
 7     2     3       6
 8     2     4       6
 9     3     0       5
10     3     1       0
....

所以你可以过滤掉没有ID和时间答案的情况，即Num_Ans = 0：

df_Sum %>% filter(Num_Ans == 0)

     ID  Time Num_Ans
  <int> <dbl>   <int>
1     1     4       0
2     3     1       0

您可以过滤掉并非所有答案都得到回答的情况，即Num_Ans＆lt; 6和Num_Ans！= 0：

df_Sum %>% 
 filter(Num_Ans < 6 & Num_Ans != 0)

    ID  Time Num_Ans
  <int> <dbl>   <int>
1     2     0       3
2     3     0       5
3     3     3       5

根据几个条件提取NA数据

3 个答案: