用查找表中的(正确)值更新/替换主数据集中的NA的最有效方法是什么?这是很普通的操作!类似的问题似乎没有整齐的解决方案。
约束:
1)请假设比给出的例子有大量的缺失值和更大的查找表。因此,按情况进行替换操作是不切实际的(没有val input = "hdfs://master:9000/data/test"
val allfiles = sparkContext.binaryFiles(input)
val temp = allfiles.map(file => (file._1, file._2.toArray))
val file_map = temp.collectAsMap()
file_map.foreach(m => {
val fs = FileSystem.get(new Path(m._1).toUri, sparkContext.hadoopConfiguration)
val file = fs.open(new Path(m._1))
val buf = m._2
val buf2 = new Array[Byte](buf.length)
file.read(buf2)
file.close()
assert(buf sameElements buf2)
}
)
,case_when
等)
2)查找表并不具有主数据帧的所有值,而仅具有替换值。
Tidyverse解决方案更受欢迎。类似的问题似乎没有整洁的解决方案。
if_else
理想情况下,left_join将为缺失值提供替换选项。 las ...
library(tidyverse)
### Main Dataframe ###
df1 <- tibble(
state_abbrev = state.abb[1:10],
state_name = c(state.name[1:5], rep(NA, 3), state.name[9:10]),
value = sample(500:1200, 10, replace=TRUE)
)
#> # A tibble: 10 x 3
#> state_abbrev state_name value
#> <chr> <chr> <int>
#> 1 AL Alabama 551
#> 2 AK Alaska 765
#> 3 AZ Arizona 508
#> 4 AR Arkansas 756
#> 5 CA California 741
#> 6 CO <NA> 1100
#> 7 CT <NA> 719
#> 8 DE <NA> 874
#> 9 FL Florida 749
#> 10 GA Georgia 580
### Lookup Dataframe ###
lookup_df <- tibble(
state_abbrev = state.abb[6:8],
state_name = state.name[6:8]
)
#> # A tibble: 3 x 2
#> state_abbrev state_name
#> <chr> <chr>
#> 1 CO Colorado
#> 2 CT Connecticut
#> 3 DE Delaware
```
由reprex package(v0.2.0)于2018-07-28创建。
答案 0 :(得分:5)
收集Alistaire's和Nettle's的建议并转化为可行的解决方案
df1 %>%
left_join(lookup_df, by = "state_abbrev") %>%
mutate(state_name = coalesce(state_name.x, state_name.y)) %>%
select(-state_name.x, -state_name.y)
# A tibble: 10 x 3 state_abbrev value state_name <chr> <int> <chr> 1 AL 671 Alabama 2 AK 501 Alaska 3 AZ 1030 Arizona 4 AR 694 Arkansas 5 CA 881 California 6 CO 821 Colorado 7 CT 742 Connecticut 8 DE 665 Delaware 9 FL 948 Florida 10 GA 790 Georgia
OP表示希望使用“ tidyverse”解决方案。但是,更新联接已在data.table
软件包中提供:
library(data.table)
setDT(df1)[setDT(lookup_df), on = "state_abbrev", state_name := i.state_name]
df1
state_abbrev state_name value 1: AL Alabama 1103 2: AK Alaska 1036 3: AZ Arizona 811 4: AR Arkansas 604 5: CA California 868 6: CO Colorado 1129 7: CT Connecticut 819 8: DE Delaware 1194 9: FL Florida 888 10: GA Georgia 501
library(bench)
bm <- press(
na_share = c(0.1, 0.5, 0.9),
n_row = length(state.abb) * 2 * c(1, 100, 10000),
{
n_na <- na_share * length(state.abb)
set.seed(1)
na_idx <- sample(length(state.abb), n_na)
tmp <- data.table(state_abbrev = state.abb, state_name = state.name)
lookup_df <-tmp[na_idx]
tmp[na_idx, state_name := NA]
df0 <- as_tibble(tmp[sample(length(state.abb), n_row, TRUE)])
mark(
dplyr = {
df1 <- copy(df0)
df1 <- df1 %>%
left_join(lookup_df, by = "state_abbrev") %>%
mutate(state_name = coalesce(state_name.x, state_name.y)) %>%
select(-state_name.x, -state_name.y)
df1
},
upd_join = {
df1 <- copy(df0)
setDT(df1)[setDT(lookup_df), on = "state_abbrev", state_name := i.state_name]
df1
}
)
}
)
ggplot2::autoplot(bm)
data.table
的upup连接总是更快(请注意日志时间范围)。
更新联接修改数据对象时,每次运行基准测试时都会使用一个新副本。
答案 1 :(得分:2)
尽管a lookup table approach是如何实现这种行为的,但目前尚无一人尝试合并多于一个的列(可以通过在ifelse(is.na(value), ..., value)
中使用there has been discussion来完成)。 。现在,您可以手动构建它。如果您有很多专栏,则可以编程方式coalesce
,甚至可以put it in a function。
library(tidyverse)
df1 <- tibble(
state_abbrev = state.abb[1:10],
state_name = c(state.name[1:5], rep(NA, 3), state.name[9:10]),
value = sample(500:1200, 10, replace=TRUE)
)
lookup_df <- tibble(
state_abbrev = state.abb[6:8],
state_name = state.name[6:8]
)
df1 %>%
full_join(lookup_df, by = 'state_abbrev') %>%
bind_cols(map_dfc(grep('.x', names(.), value = TRUE), function(x){
set_names(
list(coalesce(.[[x]], .[[gsub('.x', '.y', x)]])),
gsub('.x', '', x)
)
})) %>%
select(union(names(df1), names(lookup_df)))
#> # A tibble: 10 x 3
#> state_abbrev state_name value
#> <chr> <chr> <int>
#> 1 AL Alabama 877
#> 2 AK Alaska 1048
#> 3 AZ Arizona 973
#> 4 AR Arkansas 860
#> 5 CA California 938
#> 6 CO Colorado 639
#> 7 CT Connecticut 547
#> 8 DE Delaware 672
#> 9 FL Florida 667
#> 10 GA Georgia 1142
答案 2 :(得分:1)
为了保留列顺序:
<link href="https://stackpath.bootstrapcdn.com/font-awesome/4.7.0/css/font-awesome.min.css" rel="stylesheet" integrity="sha384-wvfXpqpZZVQGK6TAh5PVlGOfQNHSoD2xbE+QkPxCAFlNEevoEH3Sl0sibVcOQVnN" crossorigin="anonymous">
<div class="socialIcons">
<div class="add-cart-new">
<a class="add-cart-a">
<input id="checkbox" type="checkbox">
<label for="checkbox" class="text-add-cart">
<span></span>
<i class="fa-3x fa fa-plus-circle"></i>
</label>
</a>
</div>
</div>
答案 3 :(得分:0)
这里是rows_update()
的单行解决方案:
df1 %>%
rows_update(lookup_df, by = "state_abbrev")
演示:
library(dplyr)
### Main Dataframe ###
df1 <- tibble(
state_abbrev = state.abb[1:10],
state_name = c(state.name[1:5], rep(NA, 3), state.name[9:10]),
value = sample(500:1200, 10, replace=TRUE)
)
### Lookup Dataframe ###
lookup_df <- tibble(
state_abbrev = state.abb[6:8],
state_name = state.name[6:8]
)
df1 %>%
rows_update(lookup_df, by = "state_abbrev")
#> # A tibble: 10 x 3
#> state_abbrev state_name value
#> <chr> <chr> <int>
#> 1 AL Alabama 532
#> 2 AK Alaska 640
#> 3 AZ Arizona 521
#> 4 AR Arkansas 523
#> 5 CA California 970
#> 6 CO Colorado 695
#> 7 CT Connecticut 504
#> 8 DE Delaware 1088
#> 9 FL Florida 979
#> 10 GA Georgia 1059
答案 4 :(得分:-1)
如果缩写列已完成并且查找表已完成,您能否只删除state_name列然后加入?
left_join(df1 %>% select(-state_name), lookup_df, by = 'state_abbrev') %>%
select(state_abbrev, state_name, value)
另一种选择是使用内置状态名称和缩写列表在match
调用中使用if_else
和mutate
:
df1 %>%
mutate(state_name = if_else(is.na(state_name), state.name[match(state_abbrev,state.abb)], state_name))
两者都给出相同的输出:
# A tibble: 10 x 3
state_abbrev state_name value
<chr> <chr> <int>
1 AL Alabama 525
2 AK Alaska 719
3 AZ Arizona 1186
4 AR Arkansas 1051
5 CA California 888
6 CO Colorado 615
7 CT Connecticut 578
8 DE Delaware 894
9 FL Florida 536
10 GA Georgia 599