按名称和时间保留唯一条目

时间:2017-06-29 06:00:21

标签: r

我面对的一些代码高尔夫,并且苦苦挣扎。我坚持使用长格式的复杂数据集,我需要进行广泛的分析。我设法轻松转换。但是,由于数据的填充方式,转换后数据集中存在冗余。所以这是一个MWE,我面临的问题是:

id <- c("ana","ana","ana", "brad","ana","brad","brad","brad", "matt", "matt", "matt")
hour <- c(0,    0,    24,    0,     48,    24,   NA,    72,    0 ,     24,     48 )
assessment <- c("memory", "memory", "attention",  "verbal",  "attention", "memory", "attention","attention", "memory", "attention", "attention")
value <- c(0.000,NA,0.895,0.000,15.000, 3, 5, NA,2,  4,5 )

mydata<-data.frame(id, hour, assessment, value)

结果:

> mydata
     id hour assessment  value
1   ana    0     memory  0.000
2   ana    0     memory     NA
3   ana   24  attention  0.895
4  brad    0     verbal  0.000
5   ana   48  attention 15.000
6  brad   24     memory  3.000
7  brad   NA  attention  5.000
8  brad   72  attention     NA
9  matt    0     memory  2.000
10 matt   24  attention  4.000
11 matt   48  attention  5.000

之后:

library(dplyr)
library(tidyr)
mydata %>% 
    group_by(id) %>%
    mutate(i1=row_number()) %>% 
    spread(assessment, value)

到达:

Source: local data frame [11 x 6]
Groups: id [3]

       id  hour    i1 attention memory verbal
*  <fctr> <dbl> <int>     <dbl>  <dbl>  <dbl>
1     ana     0     1        NA      0     NA
2     ana     0     2        NA     NA     NA
3     ana    24     3     0.895     NA     NA
4     ana    48     4    15.000     NA     NA
5    brad     0     1        NA     NA      0
6    brad    24     2        NA      3     NA
7    brad    72     4        NA     NA     NA
8    brad    NA     3     5.000     NA     NA
9    matt     0     1        NA      2     NA
10   matt    24     2     4.000     NA     NA
11   matt    48     3     5.000     NA     NA

请注意,ana有两个小时0和内存条目;和布拉德有一个零,另一个缺少。丢失也应该被视为零,这是收集数据的人的输入错误。

下表显示了ana和brad的条目应该如何。应该折叠/合并相同的id和小时(包括NA)的重复(查看下面的第1行和第5行)。

       id  hour    i1 attention memory verbal
*  <fctr> <dbl> <int>     <dbl>  <dbl>  <dbl>
1     ana     0     1        NA      0     NA
2     ana    24     3     0.895     NA     NA
4     ana    48     4    15.000     NA     NA
5    brad     0     1     5.000     NA      0
6    brad    24     2        NA      3     NA
7    brad    72     4        NA     NA     NA
9    matt     0     1        NA      2     NA
10   matt    24     2     4.000     NA     NA
11   matt    48     3     5.000     NA     NA

问题:

  • 如何在这样的数据集中减少每个主题+小时的重复项,以便它看起来像上一个表?

1 个答案:

答案 0 :(得分:1)

一个选项是replace NA为0,获取distinct行,然后按照OP的代码进行

mydata %>%
    mutate_at(vars(hour, value), funs(replace(., is.na(.), 0))) %>% 
    arrange(id, hour, desc(value)) %>% 
    distinct() %>% 
    group_by(id, hour, assessment) %>%
    spread(assessment, value)