我想使用R -
加入两个数据集数据集1
with
inputs ( ownerid, propertyid, name ) as (
select 1, 1000001, 'SMITH MARY' from dual union all
select 2, 1000001, 'SMITH JOHN' from dual union all
select 3, 1000002, 'HUGHES JANE' from dual union all
select 4, 1000003, 'CHEN ALICE' from dual union all
select 5, 1000003, 'MCCOY ELLIS' from dual
),
prep ( propertyid, name, rn ) as (
select propertyid, name,
row_number() over (partition by propertyid order by ownerid)
from inputs
)
select *
from prep
pivot (max(name) for rn in (1 as owner1, 2 as owner2))
order by propertyid
;
PROPERTYID OWNER1 OWNER2
---------- ----------- -----------
1000001 SMITH MARY SMITH JOHN
1000002 HUGHES JANE
1000003 CHEN ALICE MCCOY ELLIS
3 rows selected.
数据集2
ID Name Date Price
1 A 2011 $100
2 B 2012 $200
3 C 2013 $300
在ID ID Date Price
1 2012 $100
1 2013 $200
3 2014 $300
中使用left-join()
我最终会使用此
dplyr
我最喜欢的产品是
ID Name Date.x Price.x Date.y Price.y
1 A 2011 $100 2012 $100
1 A 2011 $100 2013 $200
2 B 2012 $200
3 C 2013 $300 2014 $300
即不是合并到现有行,我想在找到匹配项时创建一个新行,并复制不会更改的现有信息(ID和名称),并在必要时更改“日期和价格”列。有关在大型数据集上执行此操作的有效方法的任何想法吗?
答案 0 :(得分:6)
你问过有效的方法,所以我将介绍data.table:
library(data.table)
setDT(DF1)
setDT(DF2)
# structure your data so ID attributes are only in an ID table
idDT = DF1[, .(ID, Name)]
DF1[, Name := NULL]
# stack data
DT = rbind(DF1, DF2)
# grab ID attributes if you really need them
DT[idDT, on="ID", Name := i.Name]
给出了
ID Date Price Name
1: 1 2011 $100 A
2: 2 2012 $200 B
3: 3 2013 $300 C
4: 1 2012 $100 A
5: 1 2013 $200 A
6: 3 2014 $300 C
对于data.tables来说, rbind
非常快。但是,当我只是绑定两个表时,我真的不希望效率成为一个大问题。
关于旋转ID属性Name,它与dplyr包作者的推荐相匹配,后者将其称为making data tidy。
答案 1 :(得分:4)
这是@ Frank的回答略有不同。主要问题是您的第二个表没有Name
列。使用 data.table 的更新,可以非常有效地获得这一点,同时加入方法..
require(data.table)
dt2[dt1, Name := i.Name, on = "ID"] # by reference, no need to assign the result back
现在有一个Name
列,我们可以简单地rbind
结果。
ans = rbind(dt1, if (anyNA(dt2$Name)) na.omit(dt2, by="Name") else dt2)
如有必要,使用setorder()
按引用重新排序结果 :
setorder(ans, ID, Name) # by reference, no need to assign the result back
# ID Name Date Price
# 1: 1 A 2011 $100
# 2: 1 A 2012 $100
# 3: 1 A 2013 $200
# 4: 2 B 2012 $200
# 5: 3 C 2013 $300
# 6: 3 C 2014 $300
:=
运算符和 data.table 中的set*
函数通过引用修改输入对象。
dt1 = fread('ID Name Date Price
1 A 2011 $100
2 B 2012 $200
3 C 2013 $300')
dt2 = fread('ID Date Price
1 2012 $100
1 2013 $200
3 2014 $300')
答案 2 :(得分:1)
df1 <- data.frame(
ID=1:3,
Name=c("A","B","C"),
Date=c(2011,2012,2013),
Price=c(100,200,300)
)
df2 <- data.frame(
ID=c(1,1,3),
Date=c(2012,2013,2014),
Price=c(100,200,300)
)
left_join
无法获得所需的输出。您可以使用full_join
。
merged <- full_join(df1, df2, by=c("Date","ID"))
以下是从melt
包中使用reshape2
获取所需输出的方法:
library(reshape2)
merged <- melt(merged, id.vars=c("ID","Name","Date"))
然后:
> merged[na.omit(merged$Name), -4] #remove NAs and column from melt
ID Name Date value
1 1 A 2011 100
2 2 B 2012 200
3 3 C 2013 300
1.1 1 A 2011 100
2.1 2 B 2012 200
3.1 3 C 2013 300
答案 3 :(得分:1)
与nomatch = 0
的内部联接。例如,如果dataset2中的所有ID都是4,则内部联接不会将NA吐出到不匹配的ID。如果您删除nomatch = 0
,则会生成NA
。
编辑:根据@ Arun的建议添加了rbindlist包装
library("data.table")
rbindlist(list(df1,
setDT(df1)[i = df2,
j = .(ID, Name, Date = i.Date, Price = i.Price),
on = .(ID),
nomatch = 0]))
输出
ID Name Date Price
1: 1 A 2011 $100
2: 2 B 2012 $200
3: 3 C 2013 $300
4: 1 A 2012 $100
5: 1 A 2013 $200
6: 3 C 2014 $300
答案 4 :(得分:1)
也许有效的方法之一就是使用两步合并。
# create Dataset 1
ID <- 1:3
Name <- c("A", "B", "C")
Date <- 2011:2013
Price <- c("$100", "$200", "$300")
dataset1 <- data.frame(ID, Name, Date, Price)
# Create Dataset 2
ID <- c(1,1,3)
Date <- 2012:2014
Price <- c("$100", "$200", "$300")
dataset2 <- data.frame(ID, Date, Price)
分配缺失&#34;名称&#34;使用{base} package
中的merge
函数将值设置为数据集2
dataset2 <- merge(dataset1[c("ID", "Name")], dataset2)
合并数据集
merge(dataset1, dataset2, all = T)
给出了:
ID Name Date Price
1 1 A 2011 $100
2 1 A 2012 $100
3 1 A 2013 $200
4 2 B 2012 $200
5 3 C 2013 $300
6 3 C 2014 $300
答案 5 :(得分:0)
您可以使用Plyr加入并获取第二个DF和rbind的名称以合并行。
library(plyr)
## Add the name column to df2 and get rid of unwanted columns
df3 <- join(df2,df1,by = "ID")
df3[,6] <- NULL
df3[,5] <- NULL
combined <- rbind(df1,df3)
答案 6 :(得分:0)
> dsa
ID Name Date Price
1 1 A 2011 $100
2 2 B 2012 $200
3 3 C 2013 $300
>dsb
ID Date Price
1 1 2012 $100
2 1 2013 $200
3 3 2014 $300
>dsb$Name <- NA
>dsr <- rbind(dsa,dsb)
>dsr$Name <- dsa$Name[match(dsr$ID,dsa$ID)]
>dsr
ID Name Date Price
1 1 A 2011 $100
2 2 B 2012 $200
3 3 C 2013 $300
4 1 A 2012 $100
5 1 A 2013 $200
6 3 C 2014 $300
我是R.的新手。无法充分发挥R的潜力以获得最佳效率。但是,这可以胜任。