我有两个Pyspark df
df1
const renderLevels = () => {
if (levels.length === 3) {
return (
'all levels'
)
}
return levels.map((item, index) => {
return (
<Fragment key={index}>
{(index ? ' & ' : '')} {item.name}
</Fragment>
)
}
)
};
df2
TransID Date custusername
1 11/01 1A
2 11/01 1A
3 11/02 1A
4 11/02 1A
5 11/03 1A
连接两个数据帧并计数后所需的输出
custusername Date CustID
1A 11/01 xx1
1A 11/02 xx1
1A 11/03 xx2
我得到的实际输出是
Date CustID Count
11/01 xx1 2
11/02 xx1 2
11/03 xx2 1
因为CustID在11/03更新,所以我的计数在重复。
我的代码
11/01 xx1 2
11/01 xx2 2
11/02 xx1 2
11/02 xx2 2
11/03 xx1 1
11/03 xx2 1
答案 0 :(得分:0)
具有两个DataFrame:
df1 = spark.createDataFrame([
(1, "11/01", "1A"),
(2, "11/01", "1A"),
(3, "11/02", "1A"),
(4, "11/02", "1A"),
(5, "11/03", "1A"),
], schema=['TransId', 'Date', 'custusername'])
df1.show()
+-------+-----+------------+
|TransId| Date|custusername|
+-------+-----+------------+
| 1|11/01| 1A|
| 2|11/01| 1A|
| 3|11/02| 1A|
| 4|11/02| 1A|
| 5|11/03| 1A|
+-------+-----+------------+
df2 = spark.createDataFrame([
("1A", "11/01", "xx1"),
("1A", "11/02", "xx1"),
("1A", "11/03", "xx2"),
], schema=['custusername', 'Date', 'CustId'])
df2.show()
+------------+-----+------+
|custusername| Date|CustId|
+------------+-----+------+
| 1A|11/01| xx1|
| 1A|11/02| xx1|
| 1A|11/03| xx2|
+------------+-----+------+
我将按Date
和custusername
对第一个DataFrame进行分组。
df1_group = df1.groupBy('Date', 'custusername').count()
df1_group.show()
+-----+------------+-----+
| Date|custusername|count|
+-----+------------+-----+
|11/01| 1A| 2|
|11/03| 1A| 1|
|11/02| 1A| 2|
+-----+------------+-----+
然后只需加入df2
df = df1_group.join(df2, on=['custusername', 'Date'], how='left')
df.show()
+------------+-----+-----+------+
|custusername| Date|count|CustId|
+------------+-----+-----+------+
| 1A|11/01| 2| xx1|
| 1A|11/03| 1| xx2|
| 1A|11/02| 2| xx1|
+------------+-----+-----+------+