组合R中的行

时间:2017-10-16 13:06:57

标签: r

我在R工作时使用了很长的数据帧,但是我遇到了一些问题。我的数据帧实际上由两个较小的数据帧组成。然后,我将时间表从几个月调整到几年,以便两者共享一个共同的时间表。

然而,我现在面临的问题是,有时我有两行具有相同的时间值(每个问卷调查一行),但我希望每个时间变量只有一行。 (我附上了问题的图片,可能比我的解释更具洞察力)请注意,在这一点上,我仍然希望数据帧是长格式的,但只想摆脱"额外的行"。

谁能告诉我怎么做?

还附上头部代码,其中nomem = ID,time.compressed = time,sel01-03 =第一份问卷的一部分,close_num和gener_sat =第二份问卷的一部分。

`

structure(list(nomem_encr = c(800009L, 800009L, 800009L, 800012L, 
800015L, 800015L), timeline.compressed = c(79, 79, 95, 79, 28, 
28), sel01 = c(NA, 6L, NA, NA, NA, 7L), sel02 = c(NA, 6L, NA, 
NA, NA, 7L), sel03 = c(NA, 3L, NA, NA, NA, 5L), sel04 = c(NA, 
6L, NA, NA, NA, 6L), close_num = c(1, NA, 0.2, 1, 0.8, NA), gener_sat = c(7L, 
NA, 7L, 8L, 7L, NA)), .Names = c("nomem_encr", "timeline.compressed", 
"sel01", "sel02", "sel03", "sel04", "close_num", "gener_sat"), class = "data.frame", row.names = c(NA, 
6L))

`

https://i.stack.imgur.com/3p038.png

2 个答案:

答案 0 :(得分:0)

使用reshape2和dplyr包

加载库和数据:

library(reshape2)
library(dplyr)

x <- structure(
  list(
    nomem_encr = c(800009L, 800009L, 800009L, 800012L, 800015L, 800015L),
    timeline.compressed = c(79, 79, 95, 79, 28,  28),
    sel01 = c(NA, 6L, NA, NA, NA, 7L),
    sel02 = c(NA, 6L, NA,  NA, NA, 7L),
    sel03 = c(NA, 3L, NA, NA, NA, 5L),
    sel04 = c(NA,  6L, NA, NA, NA, 6L),
    close_num = c(1, NA, 0.2, 1, 0.8, NA),
    gener_sat = c(7L,  NA, 7L, 8L, 7L, NA)
  ), 
  .Names = c(
    "nomem_encr", "timeline.compressed",
    "sel01", "sel02", "sel03", "sel04", "close_num", "gener_sat"
  ),
  class = "data.frame",
  row.names = c(NA, 6L)
)
x

以下是您的数据:

  nomem_encr timeline.compressed sel01 sel02 sel03 sel04 close_num gener_sat
1     800009                  79    NA    NA    NA    NA       1.0         7
2     800009                  79     6     6     3     6        NA        NA
3     800009                  95    NA    NA    NA    NA       0.2         7
4     800012                  79    NA    NA    NA    NA       1.0         8
5     800015                  28    NA    NA    NA    NA       0.8         7
6     800015                  28     7     7     5     6        NA        NA

现在让我们将数据融合成长形式:

melt(data = x, id.vars = c("nomem_encr", "timeline.compressed")) %>%
head(15)

输出:

   nomem_encr timeline.compressed variable value
1      800009                  79    sel01    NA
2      800009                  79    sel01     6
3      800009                  95    sel01    NA
4      800012                  79    sel01    NA
5      800015                  28    sel01    NA
6      800015                  28    sel01     7
7      800009                  79    sel02    NA
8      800009                  79    sel02     6
9      800009                  95    sel02    NA
10     800012                  79    sel02    NA
11     800015                  28    sel02    NA
12     800015                  28    sel02     7
13     800009                  79    sel03    NA
14     800009                  79    sel03     3
15     800009                  95    sel03    NA

如果我们转换融合的数据框,默认行为是计算每个项目的条目数:

melt(data = x, id.vars = c("nomem_encr", "timeline.compressed")) %>%
  dcast(
    formula = nomem_encr + timeline.compressed ~ variable
  )

输出:

Aggregation function missing: defaulting to length
  nomem_encr timeline.compressed sel01 sel02 sel03 sel04 close_num gener_sat
1     800009                  79     2     2     2     2         2         2
2     800009                  95     1     1     1     1         1         1
3     800012                  79     1     1     1     1         1         1
4     800015                  28     2     2     2     2         2         2

800009 79标识的项目有2个条目(使用nomem_encrtimeline.compressed作为标识变量)。

我们可以将默认行为更改为其他内容,例如sum

melt(data = x, id.vars = c("nomem_encr", "timeline.compressed")) %>%
  dcast(
    formula = nomem_encr + timeline.compressed ~ variable,
    fun.aggregate = function(xs) sum(xs, na.rm = TRUE)
  )

输出:

  nomem_encr timeline.compressed sel01 sel02 sel03 sel04 close_num gener_sat
1     800009                  79     6     6     3     6       1.0         7
2     800009                  95     0     0     0     0       0.2         7
3     800012                  79     0     0     0     0       1.0         8
4     800015                  28     7     7     5     6       0.8         7

答案 1 :(得分:0)

您可以使用dplyr + tidyr

执行此操作
library(dplyr)
library(tidyr)

df %>%
  group_by(nomem_encr, timeline.compressed) %>%
  summarize_all(funs(sort(.)[1]))

<强>结果:

# A tibble: 4 x 8
# Groups:   nomem_encr [?]
  nomem_encr timeline.compressed sel01 sel02 sel03 sel04 close_num gener_sat
       <int>               <dbl> <int> <int> <int> <int>     <dbl>     <int>
1     800009                  79     6     6     3     6       1.0         7
2     800009                  95    NA    NA    NA    NA       0.2         7
3     800012                  79    NA    NA    NA    NA       1.0         8
4     800015                  28     7     7     5     6       0.8         7

如果您想用零替换NA,您可以执行以下操作:

df %>%
  group_by(nomem_encr, timeline.compressed) %>%
  summarize_all(funs(sort(.)[1])) %>%
  mutate_all(funs(replace(., is.na(.), 0)))

<强>结果:

# A tibble: 4 x 8
# Groups:   nomem_encr [3]
  nomem_encr timeline.compressed sel01 sel02 sel03 sel04 close_num gener_sat
       <int>               <dbl> <dbl> <dbl> <dbl> <dbl>     <dbl>     <dbl>
1     800009                  79     6     6     3     6       1.0         7
2     800009                  95     0     0     0     0       0.2         7
3     800012                  79     0     0     0     0       1.0         8
4     800015                  28     7     7     5     6       0.8         7

数据:

df = structure(list(nomem_encr = c(800009L, 800009L, 800009L, 800012L, 
800015L, 800015L), timeline.compressed = c(79, 79, 95, 79, 28, 
28), sel01 = c(NA, 6L, NA, NA, NA, 7L), sel02 = c(NA, 6L, NA, 
NA, NA, 7L), sel03 = c(NA, 3L, NA, NA, NA, 5L), sel04 = c(NA, 
6L, NA, NA, NA, 6L), close_num = c(1, NA, 0.2, 1, 0.8, NA), gener_sat = c(7L, 
NA, 7L, 8L, 7L, NA)), .Names = c("nomem_encr", "timeline.compressed", 
"sel01", "sel02", "sel03", "sel04", "close_num", "gener_sat"), class = "data.frame", row.names = c(NA, 
6L))