我正在寻找管道支持下面问题的解决方案。
我的数据如下:
tibble(
column_set_1_1 = c(1, 2, 3), column_set_1_2 = c(2, 3, NA), column_set_1_3 = c(3, NA, NA),
column_set_2_1 = c(1, 2, 3), column_set_2_2 = c(4, 5, 6), column_set_2_3 = c(7, 8, 9),
column_set_2_4 = c(10, 11, NA), column_set_2_5 = c(13, NA, NA), column_set_2_6 = c(NA, NA, NA)
)
# A tibble: 3 × 9
column_set_1_1 column_set_1_2 column_set_1_3 column_set_2_1 column_set_2_2 column_set_2_3 column_set_2_4 column_set_2_5 column_set_2_6
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
1 1 2 3 1 4 7 10 13 NA
2 2 3 NA 2 5 8 11 NA NA
3 3 NA NA 3 6 9 NA NA NA
我基本上希望按列集获取最后一个非NA值。所以,预期的输出是:
tibble(
column_set_1 = c(3, 3, 3),
column_set_2 = c(13, 11, 9)
)
# A tibble: 3 × 2
column_set_1 column_set_2
<dbl> <dbl>
1 3 13
2 3 11
3 3 9
答案 0 :(得分:7)
这是tidyverse
方法,不重新整形原始数据框,但按列名称模式将其拆分为组,并使用coalesce
函数获取每个子数据框中的最后一个非NA值:
library(tidyverse)
df_foo %>%
mutate_all(as.numeric) %>%
split.default(f = sub("_\\d+$", "", names(.))) %>%
map_df(~do.call(coalesce, setNames(rev(.), NULL)))
# A tibble: 3 × 2
# column_set_1 column_set_2
# <dbl> <dbl>
#1 3 13
#2 3 11
#3 3 9
答案 1 :(得分:1)
以下是使用tidyverse
工具的解决方案:
library(dplyr)
library(tidyr)
library(stringr)
library(tibble)
get_last_nonNA <- function(vec) {
return(last(vec[!is.na(vec)]))
}
convert_table_last_nonNA <- . %>%
rownames_to_column() %>%
gather(key=column_type, value=value, -rowname) %>%
mutate(column_set=str_extract(string=column_type,
pattern="[0-9]+")) %>%
group_by(column_set, rowname) %>%
summarise(last_nonNA_value=get_last_nonNA(value)) %>%
spread(key=column_set, value=last_nonNA_value) %>%
select(-rowname) %>%
select(colnames(.) %>% as.integer() %>% order()) %>%
"colnames<-"(paste0("column_set_", colnames(.)))
# Usage
data_tbl <- tibble(
column_set_1_1 = c(1, 2, 3), column_set_1_2 = c(2, 3, NA),
column_set_1_3 = c(3, NA, NA), column_set_2_1 = c(1, 2, 3),
column_set_2_2 = c(4, 5, 6), column_set_2_3 = c(7, 8, 9),
column_set_2_4 = c(10, 11, NA), column_set_2_5 = c(13, NA, NA),
column_set_2_6 = c(NA, NA, NA)
)
convert_table_last_nonNA(data_tbl)
# # A tibble: 3 × 2
# column_set_1 column_set_2
# * <dbl> <dbl>
# 1 3 13
# 2 3 11
# 3 3 9
它的作用,一步一步:
convert_table_last_nonNA <- . %>%
; rownames_to_column()
将行名称添加到单独的列,以获取每行提取最后一次非NA数据的信息; gather(key=column_type, value=value, -rowname)
将输入表格转换为长格式:行现在代表关键列(rowname
和column_type
)和值(value
)的组合; column_type
字符串中提取第一个数字)并将其存储在单独的列column_set
中。这是通过mutate(column_set=str_extract(string=column_type, pattern="[0-9]+"))
; group_by(column_set, rowname) %>% summarise(last_nonNA_value=get_last_nonNA(value))
以所需方式汇总数据。这是&#34;对于column_set
和rowname
的每个组合,给出value
的最后一个nonNA值(通过get_last_nonNA
调用)并将其存储在{{1}列中}&#34 ;. 注意:如果last_nonNA_value
和NA
的某些组合只有column_set
,则结果将为NA; rowname
以宽格式转换表格。现在,spread(key=column_set, value=last_nonNA_value)
中的每个项目都有一列,其值为column_set
s; last_nonNA_value
,因为不再需要它; rowname
将直接放在column_set_10
之后)。这是通过column_set_1
; select(colnames(.) %>% as.integer() %>% order())
column_set_
醇>
答案 2 :(得分:0)
这是我提出的一个与管道配合使用的解决方案:
df_foo %>%
gather(key = Key, value = Value, -ID) %>%
mutate(set = str_extract(Key, "column_set_[0-9]")) %>%
mutate(number = str_extract(Key, "(?<=column_set_[0-9]_)[0-9]+")) %>%
group_by(ID, set) %>%
dplyr::filter(!is.na(Value)) %>%
arrange(number) %>%
slice(n()) %>%
select(-number, -Key) %>%
spread(key = set, value = Value)
我不喜欢我必须arrange
然后slice
排在最后一排的事实 - 对我来说似乎不优雅。欢迎任何改进。