我在2 - 4年内观察过a
- d
个主题,每年都会报告一个数值。我想提取每个主题的第一个和最后一个值,忽略NA。如何创建新变量first_value
和last_value
?在这个例子中,我已经包含了所需的结果:
df <- data.frame(subject = c("a","b","c","d"),
year1 = c(1, 2, NA, NA),
year2 = c(3, 4, NA, 5),
year3 = c(6, 7, 8, NA),
year4 = c(9, 10, NA, 11),
first_value <- c(1, 2, 8, 5),
last_value <- c(9, 10, 8, 11))
如果变量year1
- year4
是绝对的,会有什么解决方案?
答案 0 :(得分:3)
使用data.table
包:
library(data.table)
setDT(df)[, `:=` (first_value = na.omit(unlist(.SD))[1],
last_value = tail(na.omit(unlist(.SD)),1)),
by = subject][]
给出:
subject year1 year2 year3 year4 first_value last_value
1: a 1 3 6 9 1 9
2: b 2 4 7 10 2 10
3: c NA NA 8 NA 8 8
4: d NA 5 NA 11 5 11
根据@alexis_laz的建议,您可以按如下方式使用max.col
来获取相关值:
f <- max.col(!is.na(df[c("year1", "year2", "year3", "year4")]), 'first')
l <- max.col(!is.na(df[c("year1", "year2", "year3", "year4")]), 'last')
df$first_value <- sapply(seq_along(f), function(i) df[,-1][i,f[i]])
df$last_value <- sapply(seq_along(l), function(i) df[,-1][i,l[i]])
这会得到相同的结果。正如@alexis_laz在评论中所建议的那样,可以进一步改进:
m <- as.matrix(df[c("year1", "year2", "year3", "year4")])
f <- max.col(!is.na(m), 'first')
l <- max.col(!is.na(m), 'last')
df$first_value <- df[-1][cbind(1:nrow(df), f)]
df$last_value <- df[-1][cbind(1:nrow(df), l)]
并使用dplyr
和tidyr
套餐:
library(dplyr)
library(tidyr)
df %>%
gather(year, val, 2:5) %>%
filter(!is.na(val)) %>%
group_by(subject) %>%
summarise(first_value = first(val),
last_value = last(val)) %>%
left_join(df, ., by = 'subject')
警告:不使用filter
并在na.omit(val)
中使用val[!is.na(val)]
(或summarise
)的变体:
df %>%
gather(year, val, 2:5) %>%
group_by(subject) %>%
summarise(first_value = first(na.omit(val)),
last_value = last(na.omit(val))) %>%
left_join(df, ., by = 'subject')
由于报告的错误here和here,无效。
答案 1 :(得分:0)
使用data.frame
和gather
#Used packages
library(tidyr)
library(dplyr)
subject<-c("a","b","c","d")
year1 <- c(1, 2, NA, NA)
year2 <- c(3, 4, NA, 5)
year3 <- c(6, 7, 8, NA)
year4 <- c(9, 10, NA, 11)
dt = data.frame(subject, year1, year2, year3, year4)
gather()
将多列折叠为一列
dt <- dt %>% gather(year, value, year1:year4)
summarise( )
:执行所选变量的摘要统计
dt %>% group_by(subject)%>%
summarise(first_value = min(value, na.rm=TRUE),
last_value = max(value, na.rm=TRUE))
输出:
# A tibble: 4 × 3
subject first_value last_value
<fctr> <dbl> <dbl>
1 a 1 9
2 b 2 10
3 c 8 8
4 d 5 11