我使用pyspark并使用以下数据框:
+---------+----+--------------------+-------------------+
| id| sid| values| ratio|
+---------+----+--------------------+-------------------+
| 6052791|4178|[2#2#2#2#3#3#3#3#...|0.32673267326732675|
| 57908575|4178|[2#2#2#2#3#3#3#3#...| 0.3173076923076923|
| 78836630|4178|[2#2#2#2#3#3#3#3#...| 0.782608695652174|
|109252111|4178|[2#2#2#2#3#3#3#3#...| 0.2803738317757009|
|139428308|4385|[2#2#2#3#4#4#4#4#...| 1.140625|
|173158079|4320|[2#2#2#2#3#3#3#3#...|0.14049586776859505|
|183739386|4390|[3#2#2#3#3#2#4#4#...|0.32080419580419584|
|206815630|4178|[2#2#2#2#3#3#3#3#...|0.14782608695652175|
|242251660|4320|[2#2#2#2#3#3#3#3#...| 0.1452991452991453|
|272670796|5038|[3#2#2#2#2#2#2#3#...| 0.2648648648648649|
|297848516|4320|[2#2#2#2#3#3#3#3#...|0.12195121951219512|
|346566485|4113|[2#3#3#2#2#2#2#3#...| 0.646823138928402|
|369667874|5038|[2#2#2#2#2#2#2#3#...| 0.4546293788454067|
|374645154|4320|[2#2#2#2#3#3#3#3#...|0.34782608695652173|
|400996010|4320|[2#2#2#2#3#3#3#3#...|0.14049586776859505|
|401594848|4178|[3#3#6#6#3#3#4#4#...| 0.7647058823529411|
|401954629|4569|[3#3#3#3#3#3#3#3#...| 0.5520833333333333|
|417115190|4320|[2#2#2#2#3#3#3#3#...| 0.6235294117647059|
|423877535|4178|[2#2#2#2#3#3#3#3#...| 0.5538461538461539|
|445523599|4320|[2#2#2#2#3#3#3#3#...| 0.1271186440677966|
+---------+----+--------------------+-------------------+
我想要的是将sid 4178作为列并将舍入比率作为其行值。结果应如下所示:
+---------+-------+------+-------+
| id| 4178 |4385 | 4390 |(if sid for id fill row with ratio)
+---------+-------+------+-------+
| 6052791|0.32 | 0 | 0 |(if not fill with 0)
id 4178
6052791 0.32
列数是具有相同舍入{1}}的sids数。
如果任何id不存在该sid,则ratio
列必须包含0。
答案 0 :(得分:1)
这听起来像是一个可以在Spark SQL(Scala版本)中的支点,如下所示:
scala> ratios.
groupBy("id").
pivot("sid").
agg(first("ratio")).
show
+-------+-------------------+
| id| 4178|
+-------+-------------------+
|6052791|0.32673267326732675|
+-------+-------------------+
我仍然不确定如何选择其他列(示例中为4385和4390)。您似乎围绕ratio
并搜索其他匹配的sid
。
答案 1 :(得分:1)
你需要一个groupby的列,为此我要添加一个名为sNo的新列。
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val df = sc.parallelize(List((6052791, 4178, 0.42673267326732675),
(6052791, 4178, 0.22673267326732675),
(6052791, 4179, 0.62673267326732675),
(6052791, 4180, 0.72673267326732675),
(6052791, 4179, 0.82673267326732675),
(6052791, 4179, 0.92673267326732675))).toDF("id", "sid", "ratio")
df.withColumn("sNo", lit(1))
.groupBy("sNo")
.pivot("sid")
.agg(min("ratio"))
.show
这将返回输出
+---+-------------------+------------------+------------------+
|sNo| 4178| 4179| 4180|
+---+-------------------+------------------+------------------+
| 1|0.22673267326732674|0.6267326732673267|0.7267326732673267|
+---+-------------------+------------------+------------------+