在_merge
合并后,有没有办法获得等同于dplyr
指标变量?
类似于 Pandas' indicator = True
选项的内容,它基本上告诉您合并的方式(每个数据集中有多少匹配等)。
以下是Pandas
import pandas as pd
df1 = pd.DataFrame({'key1' : ['a','b','c'], 'v1' : [1,2,3]})
df2 = pd.DataFrame({'key1' : ['a','b','d'], 'v2' : [4,5,6]})
match = df1.merge(df2, how = 'left', indicator = True)
此处,在left join
和df1
之间df2
之后,您想立即知道df1
中df2
中找到匹配的行数以及如何其中许多人没有
match
Out[53]:
key1 v1 v2 _merge
0 a 1 4.0 both
1 b 2 5.0 both
2 c 3 NaN left_only
我可以将此merge
变量列表:
match._merge.value_counts()
Out[52]:
both 2
left_only 1
right_only 0
Name: _merge, dtype: int64
在dplyr
key1 = c('a','b','c')
v1 = c(1,2,3)
key2 = c('a','b','d')
v2 = c(4,5,6)
df1 = data.frame(key1,v1)
df2 = data.frame(key2,v2)
> left_join(df1,df2, by = c('key1' = 'key2'))
key1 v1 v2
1 a 1 4
2 b 2 5
3 c 3 NA
我在这里遗漏了什么吗? 谢谢!
答案 0 :(得分:6)
Stata在执行任何类型的合并或连接时类似地创建了一个新变量var _commonFolder = '../Presentation/Base/Default/js/source/_common/'
require(_commonFolder + 'docReady.js');
// the above require will fail to load anything, but no error message
require('../Presentation/Base/Default/js/source/_common/docReady.js');
// the above require successfully load the content
。我也觉得有必要选择一个选项,以便在执行后快速诊断合并。
在过去的几个月里,我一直在使用我编写的基本功能,只是修饰_merge
连接。可能有更有效的方法,但这是一个修饰dplyr
的例子。如果您设置选项full_join
,您将获得一个名为.merge = T
的变量,类似于 Stata 或 Pandas 中的_merge。 (这也打印出一个诊断消息,关于每次使用它时匹配的数量和不匹配的数量。)我知道你已经有了问题的答案,但如果你想要一个功能,你可以重复使用,它的工作方式相同.merge
中的full_join
,这是一个开始。你显然需要加载dplyr才能完成这项工作......
dplyr
举个例子:
full_join_track <- function(x, y, by = NULL, suffix = c(".x", ".y"),
.merge = FALSE, ...){
# Checking to make sure used variable names are not already in use
if(".x_tracker" %in% names(x)){
message("Warning: variable .x_tracker in left data was dropped")
}
if(".y_tracker" %in% names(y)){
message("Warning: variable .y_tracker in right data was dropped")
}
if(.merge & (".merge" %in% names(x) | ".merge" %in% names(y))){
stop("Variable .merge already exists; change name before proceeding")
}
# Adding simple merge tracker variables to data frames
x[, ".x_tracker"] <- 1
y[, ".y_tracker"] <- 1
# Doing full join
joined <- full_join(x, y, by = by, suffix = suffix, ...)
# Calculating merge diagnoses
matched <- joined %>%
filter(!is.na(.x_tracker) & !is.na(.y_tracker)) %>%
NROW()
unmatched_x <- joined %>%
filter(!is.na(.x_tracker) & is.na(.y_tracker)) %>%
NROW()
unmatched_y <- joined %>%
filter(is.na(.x_tracker) & !is.na(.y_tracker)) %>%
NROW()
# Print merge diagnoses
message(
unmatched_x, " Rows ONLY from left data frame", "\n",
unmatched_y, " Rows ONLY from right data frame", "\n",
matched, " Rows matched"
)
# Create .merge variable if specified
if(.merge){
joined <- joined %>%
mutate(.merge =
case_when(
!is.na(.$.x_tracker) & is.na(.$.y_tracker) ~ "left_only",
is.na(.$.x_tracker) & !is.na(.$.y_tracker) ~ "right_only",
TRUE ~ "matched"
)
)
}
# Dropping tracker variables and returning data frame
joined <- joined %>%
select(-.x_tracker, -.y_tracker)
return(joined)
}
答案 1 :(得分:2)
我们根据inner_join
,anti_join
创建“合并”列,然后使用bind_rows
d1 <- inner_join(df1, df2, by = c('key1' = 'key2')) %>%
mutate(merge = "both")
bind_rows(d1, anti_join(df1, df2, by = c('key1' = 'key2')) %>%
mutate(merge = 'left_only'))