我在R工作时使用了很长的数据帧,但是我遇到了一些问题。我的数据帧实际上由两个较小的数据帧组成。然后,我将时间表从几个月调整到几年,以便两者共享一个共同的时间表。
然而,我现在面临的问题是,有时我有两行具有相同的时间值(每个问卷调查一行),但我希望每个时间变量只有一行。 (我附上了问题的图片,可能比我的解释更具洞察力)请注意,在这一点上,我仍然希望数据帧是长格式的,但只想摆脱"额外的行"。
谁能告诉我怎么做?
还附上头部代码,其中nomem = ID,time.compressed = time,sel01-03 =第一份问卷的一部分,close_num和gener_sat =第二份问卷的一部分。
`
structure(list(nomem_encr = c(800009L, 800009L, 800009L, 800012L,
800015L, 800015L), timeline.compressed = c(79, 79, 95, 79, 28,
28), sel01 = c(NA, 6L, NA, NA, NA, 7L), sel02 = c(NA, 6L, NA,
NA, NA, 7L), sel03 = c(NA, 3L, NA, NA, NA, 5L), sel04 = c(NA,
6L, NA, NA, NA, 6L), close_num = c(1, NA, 0.2, 1, 0.8, NA), gener_sat = c(7L,
NA, 7L, 8L, 7L, NA)), .Names = c("nomem_encr", "timeline.compressed",
"sel01", "sel02", "sel03", "sel04", "close_num", "gener_sat"), class = "data.frame", row.names = c(NA,
6L))
`
答案 0 :(得分:0)
加载库和数据:
library(reshape2)
library(dplyr)
x <- structure(
list(
nomem_encr = c(800009L, 800009L, 800009L, 800012L, 800015L, 800015L),
timeline.compressed = c(79, 79, 95, 79, 28, 28),
sel01 = c(NA, 6L, NA, NA, NA, 7L),
sel02 = c(NA, 6L, NA, NA, NA, 7L),
sel03 = c(NA, 3L, NA, NA, NA, 5L),
sel04 = c(NA, 6L, NA, NA, NA, 6L),
close_num = c(1, NA, 0.2, 1, 0.8, NA),
gener_sat = c(7L, NA, 7L, 8L, 7L, NA)
),
.Names = c(
"nomem_encr", "timeline.compressed",
"sel01", "sel02", "sel03", "sel04", "close_num", "gener_sat"
),
class = "data.frame",
row.names = c(NA, 6L)
)
x
以下是您的数据:
nomem_encr timeline.compressed sel01 sel02 sel03 sel04 close_num gener_sat
1 800009 79 NA NA NA NA 1.0 7
2 800009 79 6 6 3 6 NA NA
3 800009 95 NA NA NA NA 0.2 7
4 800012 79 NA NA NA NA 1.0 8
5 800015 28 NA NA NA NA 0.8 7
6 800015 28 7 7 5 6 NA NA
现在让我们将数据融合成长形式:
melt(data = x, id.vars = c("nomem_encr", "timeline.compressed")) %>%
head(15)
输出:
nomem_encr timeline.compressed variable value
1 800009 79 sel01 NA
2 800009 79 sel01 6
3 800009 95 sel01 NA
4 800012 79 sel01 NA
5 800015 28 sel01 NA
6 800015 28 sel01 7
7 800009 79 sel02 NA
8 800009 79 sel02 6
9 800009 95 sel02 NA
10 800012 79 sel02 NA
11 800015 28 sel02 NA
12 800015 28 sel02 7
13 800009 79 sel03 NA
14 800009 79 sel03 3
15 800009 95 sel03 NA
如果我们转换融合的数据框,默认行为是计算每个项目的条目数:
melt(data = x, id.vars = c("nomem_encr", "timeline.compressed")) %>%
dcast(
formula = nomem_encr + timeline.compressed ~ variable
)
输出:
Aggregation function missing: defaulting to length
nomem_encr timeline.compressed sel01 sel02 sel03 sel04 close_num gener_sat
1 800009 79 2 2 2 2 2 2
2 800009 95 1 1 1 1 1 1
3 800012 79 1 1 1 1 1 1
4 800015 28 2 2 2 2 2 2
800009 79
标识的项目有2个条目(使用nomem_encr
和timeline.compressed
作为标识变量)。
我们可以将默认行为更改为其他内容,例如sum
:
melt(data = x, id.vars = c("nomem_encr", "timeline.compressed")) %>%
dcast(
formula = nomem_encr + timeline.compressed ~ variable,
fun.aggregate = function(xs) sum(xs, na.rm = TRUE)
)
输出:
nomem_encr timeline.compressed sel01 sel02 sel03 sel04 close_num gener_sat
1 800009 79 6 6 3 6 1.0 7
2 800009 95 0 0 0 0 0.2 7
3 800012 79 0 0 0 0 1.0 8
4 800015 28 7 7 5 6 0.8 7
答案 1 :(得分:0)
您可以使用dplyr
+ tidyr
:
library(dplyr)
library(tidyr)
df %>%
group_by(nomem_encr, timeline.compressed) %>%
summarize_all(funs(sort(.)[1]))
<强>结果:强>
# A tibble: 4 x 8
# Groups: nomem_encr [?]
nomem_encr timeline.compressed sel01 sel02 sel03 sel04 close_num gener_sat
<int> <dbl> <int> <int> <int> <int> <dbl> <int>
1 800009 79 6 6 3 6 1.0 7
2 800009 95 NA NA NA NA 0.2 7
3 800012 79 NA NA NA NA 1.0 8
4 800015 28 7 7 5 6 0.8 7
如果您想用零替换NA,您可以执行以下操作:
df %>%
group_by(nomem_encr, timeline.compressed) %>%
summarize_all(funs(sort(.)[1])) %>%
mutate_all(funs(replace(., is.na(.), 0)))
<强>结果:强>
# A tibble: 4 x 8
# Groups: nomem_encr [3]
nomem_encr timeline.compressed sel01 sel02 sel03 sel04 close_num gener_sat
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 800009 79 6 6 3 6 1.0 7
2 800009 95 0 0 0 0 0.2 7
3 800012 79 0 0 0 0 1.0 8
4 800015 28 7 7 5 6 0.8 7
数据:强>
df = structure(list(nomem_encr = c(800009L, 800009L, 800009L, 800012L,
800015L, 800015L), timeline.compressed = c(79, 79, 95, 79, 28,
28), sel01 = c(NA, 6L, NA, NA, NA, 7L), sel02 = c(NA, 6L, NA,
NA, NA, 7L), sel03 = c(NA, 3L, NA, NA, NA, 5L), sel04 = c(NA,
6L, NA, NA, NA, 6L), close_num = c(1, NA, 0.2, 1, 0.8, NA), gener_sat = c(7L,
NA, 7L, 8L, 7L, NA)), .Names = c("nomem_encr", "timeline.compressed",
"sel01", "sel02", "sel03", "sel04", "close_num", "gener_sat"), class = "data.frame", row.names = c(NA,
6L))