连接两个数据框后的Pyspark重复

时间:2019-12-02 12:45:03

标签: python dataframe join pyspark

我有两个Pyspark df

df1

const renderLevels = () => {

   if (levels.length === 3) {
      return (
         'all levels'
      )
   }

   return levels.map((item, index) => {
         return (
            <Fragment key={index}>
               {(index ? ' & ' : '')} {item.name}
            </Fragment>
         )
      }
   )
};

df2

TransID  Date     custusername
1        11/01      1A
2        11/01      1A
3        11/02      1A
4        11/02      1A
5        11/03      1A

连接两个数据帧并计数后所需的输出

custusername   Date    CustID
1A             11/01    xx1
1A             11/02    xx1
1A             11/03    xx2

我得到的实际输出是

Date   CustID   Count
11/01   xx1      2
11/02   xx1      2
11/03   xx2      1

因为CustID在11/03更新,所以我的计数在重复。

我的代码

11/01   xx1      2
11/01   xx2      2
11/02   xx1      2
11/02   xx2      2
11/03   xx1      1
11/03   xx2      1

1 个答案:

答案 0 :(得分:0)

具有两个DataFrame:

df1 = spark.createDataFrame([
    (1, "11/01", "1A"),
    (2, "11/01", "1A"),
    (3, "11/02", "1A"),
    (4, "11/02", "1A"),
    (5, "11/03", "1A"),
], schema=['TransId', 'Date', 'custusername'])
df1.show()
+-------+-----+------------+
|TransId| Date|custusername|
+-------+-----+------------+
|      1|11/01|          1A|
|      2|11/01|          1A|
|      3|11/02|          1A|
|      4|11/02|          1A|
|      5|11/03|          1A|
+-------+-----+------------+
df2 = spark.createDataFrame([
    ("1A", "11/01", "xx1"),
    ("1A", "11/02", "xx1"),
    ("1A", "11/03", "xx2"),
], schema=['custusername', 'Date', 'CustId'])
df2.show()
+------------+-----+------+
|custusername| Date|CustId|
+------------+-----+------+
|          1A|11/01|   xx1|
|          1A|11/02|   xx1|
|          1A|11/03|   xx2|
+------------+-----+------+

我将按Datecustusername对第一个DataFrame进行分组。

df1_group = df1.groupBy('Date', 'custusername').count()
df1_group.show()
+-----+------------+-----+
| Date|custusername|count|
+-----+------------+-----+
|11/01|          1A|    2|
|11/03|          1A|    1|
|11/02|          1A|    2|
+-----+------------+-----+

然后只需加入df2

df = df1_group.join(df2, on=['custusername', 'Date'], how='left')
df.show()
+------------+-----+-----+------+
|custusername| Date|count|CustId|
+------------+-----+-----+------+
|          1A|11/01|    2|   xx1|
|          1A|11/03|    1|   xx2|
|          1A|11/02|    2|   xx1|
+------------+-----+-----+------+