连接2个数据集并创建匹配的新行

时间:2016-08-03 18:50:36

标签: r dplyr

我想使用R -

加入两个数据集

数据集1

with
     inputs ( ownerid, propertyid, name ) as (
       select 1, 1000001, 'SMITH MARY'  from dual union all
       select 2, 1000001, 'SMITH JOHN'  from dual union all
       select 3, 1000002, 'HUGHES JANE' from dual union all
       select 4, 1000003, 'CHEN ALICE'  from dual union all
       select 5, 1000003, 'MCCOY ELLIS' from dual
     ),
     prep ( propertyid, name, rn ) as (
       select propertyid, name,
              row_number() over (partition by propertyid order by ownerid)
       from   inputs
     )
select * 
from prep
pivot (max(name) for rn in (1 as owner1, 2 as owner2))
order by propertyid
;


PROPERTYID OWNER1      OWNER2
---------- ----------- -----------
   1000001 SMITH MARY  SMITH JOHN
   1000002 HUGHES JANE
   1000003 CHEN ALICE  MCCOY ELLIS

3 rows selected.

数据集2

    ID Name Date Price
    1    A   2011 $100
    2    B   2012 $200
    3    C   2013 $300

在ID ID Date Price 1 2012 $100 1 2013 $200 3 2014 $300 中使用left-join()我最终会使用此

dplyr

我最喜欢的产品是

    ID Name Date.x Price.x Date.y Price.y
    1   A   2011    $100   2012   $100
    1   A   2011    $100   2013   $200
    2   B   2012    $200
    3   C   2013    $300   2014   $300

即不是合并到现有行,我想在找到匹配项时创建一个新行,并复制不会更改的现有信息(ID和名称),并在必要时更改“日期和价格”列。有关在大型数据集上执行此操作的有效方法的任何想法吗?

7 个答案:

答案 0 :(得分:6)

你问过有效的方法,所以我将介绍data.table:

library(data.table)
setDT(DF1)
setDT(DF2)

# structure your data so ID attributes are only in an ID table
idDT = DF1[, .(ID, Name)]
DF1[, Name := NULL]

# stack data
DT = rbind(DF1, DF2)

# grab ID attributes if you really need them
DT[idDT, on="ID", Name := i.Name]

给出了

   ID Date Price Name
1:  1 2011  $100    A
2:  2 2012  $200    B
3:  3 2013  $300    C
4:  1 2012  $100    A
5:  1 2013  $200    A
6:  3 2014  $300    C
对于data.tables来说,

rbind非常快。但是,当我只是绑定两个表时,我真的不希望效率成为一个大问题。

关于旋转ID属性Name,它与dplyr包作者的推荐相匹配,后者将其称为making data tidy

答案 1 :(得分:4)

这是@ Frank的回答略有不同。主要问题是您的第二个表没有Name列。使用 data.table 的更新,可以非常有效地获得这一点,同时加入方法..

require(data.table)
dt2[dt1, Name := i.Name, on = "ID"] # by reference, no need to assign the result back

现在有一个Name列,我们可以简单地rbind结果。

ans = rbind(dt1, if (anyNA(dt2$Name)) na.omit(dt2, by="Name") else dt2)

如有必要,使用setorder() 按引用重新排序结果

setorder(ans, ID, Name) # by reference, no need to assign the result back
#    ID Name Date Price
# 1:  1    A 2011  $100
# 2:  1    A 2012  $100
# 3:  1    A 2013  $200
# 4:  2    B 2012  $200
# 5:  3    C 2013  $300
# 6:  3    C 2014  $300
:=运算符和 data.table 中的set*函数通过引用修改输入对象。

dt1 = fread('ID Name   Date Price
              1    A   2011  $100
              2    B   2012  $200
              3    C   2013  $300')

dt2 = fread('ID  Date Price
              1  2012  $100
              1  2013  $200
              3  2014  $300')

答案 2 :(得分:1)

df1 <- data.frame(
  ID=1:3,
  Name=c("A","B","C"),
  Date=c(2011,2012,2013),
  Price=c(100,200,300)
)

df2 <- data.frame(
  ID=c(1,1,3),
  Date=c(2012,2013,2014),
  Price=c(100,200,300)
)

left_join无法获得所需的输出。您可以使用full_join

merged <- full_join(df1, df2, by=c("Date","ID"))

以下是从melt包中使用reshape2获取所需输出的方法:

library(reshape2)
merged <- melt(merged, id.vars=c("ID","Name","Date"))

然后:

> merged[na.omit(merged$Name), -4] #remove NAs and column from melt
    ID Name Date value
1    1    A 2011   100
2    2    B 2012   200
3    3    C 2013   300
1.1  1    A 2011   100
2.1  2    B 2012   200
3.1  3    C 2013   300

答案 3 :(得分:1)

nomatch = 0的内部联接。例如,如果dataset2中的所有ID都是4,则内部联接不会将NA吐出到不匹配的ID。如果您删除nomatch = 0,则会生成NA

编辑:根据@ Arun的建议添加了rbindlist包装

library("data.table")
rbindlist(list(df1, 
               setDT(df1)[i = df2, 
                          j = .(ID, Name, Date = i.Date, Price = i.Price),
                          on = .(ID), 
                          nomatch = 0]))

输出

   ID Name Date Price
1:  1    A 2011  $100
2:  2    B 2012  $200
3:  3    C 2013  $300
4:  1    A 2012  $100
5:  1    A 2013  $200
6:  3    C 2014  $300

答案 4 :(得分:1)

也许有效的方法之一就是使用两步合并。

# create Dataset 1
ID <- 1:3
Name <- c("A", "B", "C")
Date <- 2011:2013
Price <- c("$100", "$200", "$300")
dataset1 <- data.frame(ID, Name, Date, Price)

# Create Dataset 2
ID <- c(1,1,3)
Date <- 2012:2014
Price <- c("$100", "$200", "$300")
dataset2 <- data.frame(ID, Date, Price)

分配缺失&#34;名称&#34;使用{base} package

中的merge函数将值设置为数据集2
dataset2 <- merge(dataset1[c("ID", "Name")], dataset2)

合并数据集

merge(dataset1, dataset2, all = T)

给出了:

   ID Name Date Price
1  1    A 2011  $100
2  1    A 2012  $100
3  1    A 2013  $200
4  2    B 2012  $200
5  3    C 2013  $300
6  3    C 2014  $300

答案 5 :(得分:0)

您可以使用Plyr加入并获取第二个DF和rbind的名称以合并行。

library(plyr)

## Add the name column to df2 and get rid of unwanted columns
df3 <- join(df2,df1,by = "ID")
df3[,6] <- NULL
df3[,5] <- NULL

combined <- rbind(df1,df3)

答案 6 :(得分:0)

 > dsa
   ID Name Date Price
 1  1    A 2011  $100
 2  2    B 2012  $200
 3  3    C 2013  $300

 >dsb
  ID Date Price
 1  1 2012  $100
 2  1 2013  $200
 3  3 2014  $300

 >dsb$Name <- NA

 >dsr <- rbind(dsa,dsb)
 >dsr$Name <- dsa$Name[match(dsr$ID,dsa$ID)]
 >dsr

   ID Name Date Price
 1  1    A 2011  $100
 2  2    B 2012  $200
 3  3    C 2013  $300
 4  1    A 2012  $100
 5  1    A 2013  $200
 6  3    C 2014  $300

我是R.的新手。无法充分发挥R的潜力以获得最佳效率。但是,这可以胜任。