合并表并指出原始观察表

时间:2016-03-25 14:18:56

标签: r dplyr

我有两张桌子:

Inputs
Input1: Old Data Dictionary olddatadictionary.csv

       table       field type   description
1 MerzNisani       hisse LONG description 1
2 MerzNisani point_gisid LONG description 2
3    Polygon       gisid LONG description 3
4    Polygon  layer_type LONG description 4

Input2: New Data Dictionary newdatadictionary.csv

       table field  type
1 MerzNisani angle FLOAT
2 MerzNisani hisse  LONG
3    Polygon gisid  LONG

我想加入旧的和新的所有行和所有列。如果没有匹配的值,则返回缺少的NA。这可以使用dplyr full_join()函数来完成。

问题是:我想添加一个列来指示每个观察来自哪个表,如下所示

Output
Output: Joined Dictionary

       table       field  type   description which_source
       (chr)       (chr) (chr)         (chr)        (chr)
1 MerzNisani       angle FLOAT            NA          new
2 MerzNisani       hisse  LONG description 1         both
3 MerzNisani point_gisid  LONG description 2          old
4    Polygon       gisid  LONG description 3         both
5    Polygon  layer_type  LONG description 4          old

我可以添加(which_source)列,但是使用if-else添加一些详细代码 声明。有没有其他使用函数式编程范式的解决方案? 以便代码尽可能干净简单,避免if-else和for循环?

提前致谢。

2 个答案:

答案 0 :(得分:0)

在合并之前添加列似乎是要走的路:

Merge two R data frames and identify the source of each row

对于您的示例,

old$source <- "old"
new$source <- "new"
merged <- merge(old,new,all=T, by=c("table", "field", "type"))
merged$source <- apply(merged[,c("source.x","source.y")], 1, function(x) ifelse(length(na.omit(x))==2, "both", na.omit(x)))

答案 1 :(得分:0)

在@ fanli的响应的基础上,如果必须多次执行此操作,另一种方法是定义一个创建新变量的新函数,然后使用这些函数创建源变量。一个例子可能是:

table <- c("MerzNisani","MerzNisani","Polygon","Polygon")
field <- c("hisse","point_gisid","gisid","layer_type")
type <- c("LONG","LONG","LONG","LONG")
description <- c("description 1","description 2","description 3","description 4")
my.df1<-data.frame(table,field,type,description)

table <- c("MerzNisani","MerzNisani","Polygon")
field <- c("angle","hisse","gisid")
type <- c("FLOAT","LONG","LONG")
my.df2 <- data.frame(table,field,type)


full_join_source <-function(df1,df2,both_val="both"){

    #Create additional variables
    df1$temp.merge1 <- deparse(substitute(df1))
    df2$temp.merge2 <- deparse(substitute(df2))

    df_m <- full_join(df1,df2)

    #Get data source/sources
    df_m$source <- apply(df_m[c("temp.merge1","temp.merge2")],1,function(x) paste(na.omit(x),collapse=""))
    #Override source value when in both datasets
    df_m$source[nchar(df_m$source) == max(nchar(df_m$source))] <- both_val
    return(df_m[,!(names(df_m) %in% c("temp.merge1","temp.merge2"))])
}

my.fulljoin.df <- full_join_source(my.df1,my.df2,both_val="In Both")