我得到了一个大数据集,其中一个变量中有一组相对较大的缺失变量值。但由于我知道变量取决于时间和空间方面,我可以通过从另一行中获取具有精确匹配的时间 和 和空间值的值来轻松估算缺失值。假设生成的数据如下:
temporal <- c("Monday", "Monday", "Tuesday", "Tuesday","Wednesday", "Wednesday", "Thursday", "Thursday", "Friday", "Friday","Monday", "Monday", "Tuesday", "Tuesday","Wednesday", "Wednesday", "Thursday", "Thursday", "Friday", "Friday")
spatial <- c("North", "South","North", "South","North", "South","North", "South","North", "South", "North", "South","North", "South","North", "South","North", "South","North", "South")
value <- c(NA,2,3,4,5,6,7,NA,9,10,1,NA,3,4,5,6,7,8,9,NA)
df <- as.data.frame(cbind(temporal, spatial, value))
提供以下数据框:
temporal spatial value
1 Monday North NA
2 Monday South 2
3 Tuesday North 3
4 Tuesday South 4
5 Wednesday North 5
6 Wednesday South 6
7 Thursday North 7
8 Thursday South NA
9 Friday North 9
10 Friday South 10
11 Monday North 1
12 Monday South NA
13 Tuesday North 3
14 Tuesday South 4
15 Wednesday North 5
16 Wednesday South 6
17 Thursday North 7
18 Thursday South 8
19 Friday North 9
20 Friday South NA
在这种情况下,我想将 value == NA
替换为在 value
和 spatial
上具有匹配值的另一行中的 temporal
。
因此,最终结果应如下所示:
temporal spatial value
1 Monday North 1
2 Monday South 2
3 Tuesday North 3
4 Tuesday South 4
5 Wednesday North 5
6 Wednesday South 6
7 Thursday North 7
8 Thursday South 8
9 Friday North 9
10 Friday South 10
11 Monday North 1
12 Monday South 2
13 Tuesday North 3
14 Tuesday South 4
15 Wednesday North 5
16 Wednesday South 6
17 Thursday North 7
18 Thursday South 8
19 Friday North 9
20 Friday South 10
我尝试通过在 group_by
中使用 tidyverse
函数来做到这一点:
library(tidyverse)
df <- df %>%
group_by(temporal, spatial) %>%
mutate(value, unique(value[is.na(value)]))
但我收到以下错误消息:
Error: Problem with `mutate()` input `..2`.
x Input `..2` can't be recycled to size 2.
i Input `..2` is `unique(value[is.na(value)])`.
i Input `..2` must be size 2 or 1, not 0.
i The error occurred in group 1: temporal = "Friday", spatial = "North"
我是否以正确的方式处理这个问题?如果是,为什么我的代码不能像(我相信)那样工作?如果不是,什么方法是合适的?
谢谢! :)
答案 0 :(得分:1)
这是一个 dplyr
方法。我们按 temporal
和 spatial
分组,然后按 temporal
、spatial
和 value
排列,因为 NA 值将自动置于任何非NA 值。然后我们使用 mutate
根据 value
第一行中的数字创建 value
。
library(dplyr)
df %>%
group_by(temporal, spatial) %>%
arrange(temporal, spatial, value) %>%
mutate(value = value[1])
使用 tidyr::fill
的更简洁方法,保留行的结构:
library(tidyverse)
df %>%
group_by(temporal, spatial) %>%
fill(value, .direction = "downup")
# A tibble: 20 x 3
# Groups: temporal, spatial [10]
temporal spatial value
<chr> <chr> <chr>
1 Monday North 1
2 Monday South 2
3 Tuesday North 3
4 Tuesday South 4
5 Wednesday North 5
6 Wednesday South 6
7 Thursday North 7
8 Thursday South 8
9 Friday North 9
10 Friday South 10
11 Monday North 1
12 Monday South 2
13 Tuesday North 3
14 Tuesday South 4
15 Wednesday North 5
16 Wednesday South 6
17 Thursday North 7
18 Thursday South 8
19 Friday North 9
20 Friday South 10
答案 1 :(得分:1)
您的 mutate 将不起作用,因为您没有为变量分配任何值。您的 mutate()
应如下所示 mutate(value = unique(value[is.na(value)]))
。虽然这不是我的方法。我在下面所做的是创建一个不同的非 NA 值的查找表,然后将它们连接到原始数据集上。 valuedis 应该是你想要的值。
temporal <- c("Monday", "Monday", "Tuesday", "Tuesday","Wednesday", "Wednesday", "Thursday", "Thursday", "Friday", "Friday","Monday", "Monday", "Tuesday", "Tuesday","Wednesday", "Wednesday", "Thursday", "Thursday", "Friday", "Friday")
spatial <- c("North", "South","North", "South","North", "South","North", "South","North", "South", "North", "South","North", "South","North", "South","North", "South","North", "South")
value <- c(NA,2,3,4,5,6,7,NA,9,10,1,NA,3,4,5,6,7,8,9,NA)
df <- as.data.frame(cbind(temporal, spatial, value))
library(dplyr)
dfdis <- df %>%
filter(!is.na(value)) %>%
distinct(temporal,spatial,value) %>%
rename(valuedis = value)
df2 <- left_join(df,dfdis, by = c("temporal","spatial"))