如何在Spark DF中对列进行聚合和分析,该列是从包含多个词典的列创建的,如下所示:
this.socket.create(new WebSocketConfig(this.sOnOpen.bind(this), this.sOnClose.bind(this), this.sOnMessage.bind(this), this.sOnError.bind(this)));
以下是该列的示例:
create(config: WebSocketConfig): void {
window.onload = () => {
this.socket = new WebSocket(this.uri);
this.socket.onopen = config.onOpen.bind(config);
this.socket.onclose = config.onClose.bind(config);
this.socket.onmessage = res => config.onMessage.bind(config));
this.socket.onerror = config.onError.bind(config);
};
}
一个例子是总结所有键值和groupBy不同的列。
答案 0 :(得分:2)
您需要f.explode
:
json_file.json:
{"idx":1, "col":[{"k":1,"v1":1,"v2":1},{"k":1,"v1":2,"v2":6},{"k":1,"v1":2,"v2":13},{"k":1,"v1":2,"v2":2}]}
{"idx":2, "col":[{"k":2,"v1":1,"v2":1},{"k":2,"v1":3,"v2":6},{"k":2,"v1":4,"v2":10}]}
from pyspark.sql import functions as f
df = spark.read.load('file:///home/zht/PycharmProjects/test/json_file.json', format='json')
df = df.withColumn('col', f.explode(df['col']))
df = df.groupBy(df['col']['v1']).sum('col.k')
df.show()
# output:
+---------+-----------------+
|col['v1']|sum(col.k AS `k`)|
+---------+-----------------+
| 1| 3|
| 3| 2|
| 2| 3|
| 4| 2|
+---------+-----------------+