我有两张桌子:
Inputs
Input1: Old Data Dictionary olddatadictionary.csv
table field type description
1 MerzNisani hisse LONG description 1
2 MerzNisani point_gisid LONG description 2
3 Polygon gisid LONG description 3
4 Polygon layer_type LONG description 4
Input2: New Data Dictionary newdatadictionary.csv
table field type
1 MerzNisani angle FLOAT
2 MerzNisani hisse LONG
3 Polygon gisid LONG
我想加入旧的和新的所有行和所有列。如果没有匹配的值,则返回缺少的NA。这可以使用dplyr full_join()
函数来完成。
问题是:我想添加一个列来指示每个观察来自哪个表,如下所示
Output
Output: Joined Dictionary
table field type description which_source
(chr) (chr) (chr) (chr) (chr)
1 MerzNisani angle FLOAT NA new
2 MerzNisani hisse LONG description 1 both
3 MerzNisani point_gisid LONG description 2 old
4 Polygon gisid LONG description 3 both
5 Polygon layer_type LONG description 4 old
我可以添加(which_source)列,但是使用if-else添加一些详细代码 声明。有没有其他使用函数式编程范式的解决方案? 以便代码尽可能干净简单,避免if-else和for循环?
提前致谢。
答案 0 :(得分:0)
在合并之前添加列似乎是要走的路:
Merge two R data frames and identify the source of each row
对于您的示例,
old$source <- "old"
new$source <- "new"
merged <- merge(old,new,all=T, by=c("table", "field", "type"))
merged$source <- apply(merged[,c("source.x","source.y")], 1, function(x) ifelse(length(na.omit(x))==2, "both", na.omit(x)))
答案 1 :(得分:0)
在@ fanli的响应的基础上,如果必须多次执行此操作,另一种方法是定义一个创建新变量的新函数,然后使用这些函数创建源变量。一个例子可能是:
table <- c("MerzNisani","MerzNisani","Polygon","Polygon")
field <- c("hisse","point_gisid","gisid","layer_type")
type <- c("LONG","LONG","LONG","LONG")
description <- c("description 1","description 2","description 3","description 4")
my.df1<-data.frame(table,field,type,description)
table <- c("MerzNisani","MerzNisani","Polygon")
field <- c("angle","hisse","gisid")
type <- c("FLOAT","LONG","LONG")
my.df2 <- data.frame(table,field,type)
full_join_source <-function(df1,df2,both_val="both"){
#Create additional variables
df1$temp.merge1 <- deparse(substitute(df1))
df2$temp.merge2 <- deparse(substitute(df2))
df_m <- full_join(df1,df2)
#Get data source/sources
df_m$source <- apply(df_m[c("temp.merge1","temp.merge2")],1,function(x) paste(na.omit(x),collapse=""))
#Override source value when in both datasets
df_m$source[nchar(df_m$source) == max(nchar(df_m$source))] <- both_val
return(df_m[,!(names(df_m) %in% c("temp.merge1","temp.merge2"))])
}
my.fulljoin.df <- full_join_source(my.df1,my.df2,both_val="In Both")