这可能是一个非常简单的问题,但是我尝试搜索它,但是没有找到解决方案。
我有一个由65列和350万行组成的广泛数据集。 该日期如下:
GR SR var1 var2 var3 var4 var5 var6 var6
1 2 "" "" "" "" "" x
1 2 x x x "" "" ""
1 2 "" "" "" "" "" ""
1 3 x x x x "" ""
1 3 "" "" "" "" "" ""
"" = NULL
我想根据其他变量将变量1更新为6。因此,对于每个GR和SR,如果var1至var6包含x,则需要使用x更新。这将导致下表:
GR SR var1 var2 var3 var4 var5 var6
1 2 x x x "" "" x
1 2 x x x "" "" x
1 2 x x x "" "" x
1 3 x x x x "" ""
1 3 x x x x "" ""
找到这些记录后,我想删除重复的记录,但是我知道如何使用Unique
表library(data.table)
有人知道该怎么做吗?
答案 0 :(得分:1)
非常容易用data.table
语法完成:
library(data.table)
setDT(my_data)
cols = paste0('var', 1:6)
my_data[ , by = .(GR, SR),
(cols) := lapply(.SD, function(x) if (any(x == 'x')) 'x' else '')]
如果我没记错的话,您只需删除(cols) :=
部分即可同时完成两个步骤(即也获得unique
):
my_data[ , by = .(GR, SR),
lapply(.SD, function(x) if (any(x == 'x')) 'x' else '')]
答案 1 :(得分:1)
以下是在fill()
中使用tidyr
(首先加载tidyverse
)的解决方案:
df %>% group_by(GR, SR) %>%
fill(starts_with("var")) %>%
fill(starts_with("var"), .direction = "up")
# GR SR var1 var2 var3 var4 var5 var6
# <int> <int> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 1 2 x x x NA NA x
# 2 1 2 x x x NA NA x
# 3 1 2 x x x NA NA x
# 4 1 3 x x x x NA NA
# 5 1 3 x x x x NA NA
我认为空元素是NA
。如果它们是字符串""
,则需要将它们转换为NA
,否则上面的代码将不起作用。
# How to recode all "" to NA?
# Insert the following code between group_by() and fill()
mutate_all(funs(na_if(., ""))) %>%
# data
df <- structure(list(GR = c(1L, 1L, 1L, 1L, 1L),
SR = c(2L, 2L, 2L, 3L, 3L), var1 = c(NA, "x", NA, "x", NA),
var2 = c(NA, "x", NA, "x", NA), var3 = c(NA, "x", NA, "x", NA),
var4 = c(NA, NA, NA, "x", NA), var5 = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_), var6 = c("x", NA, NA, NA, NA)),
class = "data.frame", row.names = c(NA, -5L))