Question

我有以下Spark DataFrame：

aps=data.frame(agent=c('a','b','c','d','a','a','a','b','c','a','b'),product=c('P1','P2','P3','P4','P1','P1','P2','P2','P2','P3','P3'),
      sale_amount=c(1000,2000,3000,4000,1000,1000,2000,2000,2000,3000,3000))

RDD_aps=createDataFrame(sqlContext,agent_product_sale)


   agent product sale_amount
1      a      P1        1000
2      b      P2        2000
3      c      P3        3000
4      d      P4        4000
5      a      P1        1000
6      a      P1        1000
7      a      P2        2000
8      b      P2        2000
9      c      P2        2000
10     a      P3        3000
11     b      P3        3000

和 percent = data.frame（agent = c（'a'，'b'，'c'），percent = c（0.2,0.5,1.0））

agent  percent
  a      0.2
  b      0.5
  c      1.0

我需要加入（合并）两个数据框，这样我才能拥有每个代理的百分比类似这样的输出：

   agent product sale_amount     percent
1      d      P4        4000          NA
2      c      P3        3000         1.0
3      c      P2        2000         1.0
4      b      P2        2000         0.5
5      b      P2        2000         0.5
6      b      P3        3000         0.5
7      a      P1        1000         0.2
8      a      P1        1000         0.2
9      a      P1        1000         0.2
10     a      P2        2000         0.2
11     a      P3        3000         0.2

我已经尝试过了：

     joined_aps=join(RDD_aps,percent,RDD_aps$agent==percent$agent,"left_outer")

但它从百分比数据框中添加了一个新的第二个“代理”列，我不想要重复的列。

我也尝试过：

merged=merge(RDD_aps,percent, by = "agent",all.x=TRUE)

这个也添加了“agent_y”列，但我只想在（RDD_aps的代理列）中有一个代理列

Answer 1

我想我看到有人阻止＆＃39; _x＆＃39;和＆＃39; _y＆＃39;在SO上某处使用join生成的变量，但我无法找到该帖子。我个人更喜欢merge在我的操作中...我认为对我来说更容易，而且我喜欢能够使用{{1}在左/右/内/外/等联接之间切换}和all.x=TRUE/FALSE参数。我仍然感到恼人（但对于验证目的很有用）all.y=TRUE/FALSE和_x列，但我使用类似于以下示例的代码修复了这些：

_y

如何在SparkR中加入（合并）两个SparkDataFrame并保留其中一个公共列

1 个答案: