Question

如何在Spark DF中对列进行聚合和分析，该列是从包含多个词典的列创建的，如下所示：

this.socket.create(new WebSocketConfig(this.sOnOpen.bind(this), this.sOnClose.bind(this), this.sOnMessage.bind(this), this.sOnError.bind(this)));

以下是该列的示例：

create(config: WebSocketConfig): void {
    window.onload = () => {
      this.socket = new WebSocket(this.uri);
      this.socket.onopen = config.onOpen.bind(config);
      this.socket.onclose = config.onClose.bind(config);
      this.socket.onmessage = res => config.onMessage.bind(config));
      this.socket.onerror = config.onError.bind(config);
    };
  }

一个例子是总结所有键值和groupBy不同的列。

Answer 1

您需要f.explode：

json_file.json：

{"idx":1, "col":[{"k":1,"v1":1,"v2":1},{"k":1,"v1":2,"v2":6},{"k":1,"v1":2,"v2":13},{"k":1,"v1":2,"v2":2}]}
{"idx":2, "col":[{"k":2,"v1":1,"v2":1},{"k":2,"v1":3,"v2":6},{"k":2,"v1":4,"v2":10}]}

from pyspark.sql import functions as f

df = spark.read.load('file:///home/zht/PycharmProjects/test/json_file.json', format='json')
df = df.withColumn('col', f.explode(df['col']))
df = df.groupBy(df['col']['v1']).sum('col.k')
df.show()


# output:
+---------+-----------------+                                                   
|col['v1']|sum(col.k AS `k`)|
+---------+-----------------+
|        1|                3|
|        3|                2|
|        2|                3|
|        4|                2|
+---------+-----------------+

在Spark DataFrame中聚合Dicts列表

1 个答案: