我正在尝试在Apache Spark中使用pivot。
我的数据是:
+--------------------+---------+
| timestamp| user|
+--------------------+---------+
|2017-12-19T00:41:...|User_1|
|2017-12-19T00:01:...|User_2|
|2017-12-19T00:01:...|User_1|
|2017-12-19T00:01:...|User_1|
|2017-12-19T00:01:...|User_2|
+--------------------+---------+
我想转向用户列。
但我一直收到错误:
'DataFrame' object has no attribute 'pivot'
Traceback (most recent call last):
File "/usr/hdp/current/spark2-client/python/pyspark/sql/dataframe.py", line 1020, in __getattr__
"'%s' object has no attribute '%s'" % (self.__class__.__name__, name))
AttributeError: 'DataFrame' object has no attribute 'pivot'
无论如何使用它。
即。 df.groupBy('A').pivot('B') or df.pivot('B')
我的实际查询是:
# The Pivot operation will give timestamp vs Users data
pivot_pf = tf.groupBy(window(tf["timestamp"], "2 minutes"), 'user').count().select('window.start', 'user', 'count').pivot("user").sum("count")
非常感谢任何帮助。
感谢。
答案 0 :(得分:1)
import pyspark.sql.functions as func
from datetime import datetime
df = spark_session.createDataFrame(
[[datetime.strptime("2012-01-01 00:00:00", '%Y-%m-%d %H:%M:%S'), 'one'],
[datetime.strptime("2012-01-01 00:01:00", '%Y-%m-%d %H:%M:%S'), 'two'],
[datetime.strptime("2012-01-01 00:02:00", '%Y-%m-%d %H:%M:%S'), 'three'],
[datetime.strptime("2012-01-01 00:03:00", '%Y-%m-%d %H:%M:%S'), 'one'],
[datetime.strptime("2012-01-01 00:04:00", '%Y-%m-%d %H:%M:%S'), 'two']],
'dd: timestamp, user: string')
df.groupBy(func.window(df["dd"], "2 minutes")).pivot('user').agg({'dd': 'count'}).show()
预期产出:
+--------------------+----+-----+----+
| window| one|three| two|
+--------------------+----+-----+----+
|[2012-01-01 00:00...| 1| null| 1|
|[2012-01-01 00:04...|null| null| 1|
|[2012-01-01 00:02...| 1| 1|null|
+--------------------+----+-----+----+
答案 1 :(得分:0)
Pivot运行良好,如下所述。但它返回分组数据。如果我们对分组数据使用一些聚合,它将导致数据帧。
val d1 = Array(("a", "10"), ("b", "20"), ("c", "30"),("a","56"),("c","29"))
val rdd1= sc.parallelize(d1)
val df1 = rdd1.toDF("key","val")
df1.groupBy("key").pivot("val")