如何从行创建列并在python spark中输入subsequesnt列值

时间:2017-05-19 05:37:58

标签: python apache-spark pyspark pyspark-sql

我使用pyspark并使用以下数据框:

+---------+----+--------------------+-------------------+
|       id| sid|              values|              ratio|
+---------+----+--------------------+-------------------+
|  6052791|4178|[2#2#2#2#3#3#3#3#...|0.32673267326732675|
| 57908575|4178|[2#2#2#2#3#3#3#3#...| 0.3173076923076923|
| 78836630|4178|[2#2#2#2#3#3#3#3#...|  0.782608695652174|
|109252111|4178|[2#2#2#2#3#3#3#3#...| 0.2803738317757009|
|139428308|4385|[2#2#2#3#4#4#4#4#...|           1.140625|
|173158079|4320|[2#2#2#2#3#3#3#3#...|0.14049586776859505|
|183739386|4390|[3#2#2#3#3#2#4#4#...|0.32080419580419584|
|206815630|4178|[2#2#2#2#3#3#3#3#...|0.14782608695652175|
|242251660|4320|[2#2#2#2#3#3#3#3#...| 0.1452991452991453|
|272670796|5038|[3#2#2#2#2#2#2#3#...| 0.2648648648648649|
|297848516|4320|[2#2#2#2#3#3#3#3#...|0.12195121951219512|
|346566485|4113|[2#3#3#2#2#2#2#3#...|  0.646823138928402|
|369667874|5038|[2#2#2#2#2#2#2#3#...| 0.4546293788454067|
|374645154|4320|[2#2#2#2#3#3#3#3#...|0.34782608695652173|
|400996010|4320|[2#2#2#2#3#3#3#3#...|0.14049586776859505|
|401594848|4178|[3#3#6#6#3#3#4#4#...| 0.7647058823529411|
|401954629|4569|[3#3#3#3#3#3#3#3#...| 0.5520833333333333|
|417115190|4320|[2#2#2#2#3#3#3#3#...| 0.6235294117647059|
|423877535|4178|[2#2#2#2#3#3#3#3#...| 0.5538461538461539|
|445523599|4320|[2#2#2#2#3#3#3#3#...| 0.1271186440677966|
+---------+----+--------------------+-------------------+

我想要的是将sid 4178作为列并将舍入比率作为其行值。结果应如下所示:

+---------+-------+------+-------+
|       id| 4178  |4385  | 4390  |(if sid for id fill row with ratio)
+---------+-------+------+-------+
|  6052791|0.32   | 0    | 0     |(if not fill with 0)

id          4178
6052791     0.32

列数是具有相同舍入{1}}的sids数。

如果任何id不存在该sid,则ratio列必须包含0。

2 个答案:

答案 0 :(得分:1)

这听起来像是一个可以在Spark SQL(Scala版本)中的支点,如下所示:

scala> ratios.
  groupBy("id").
  pivot("sid").
  agg(first("ratio")).
  show
+-------+-------------------+
|     id|               4178|
+-------+-------------------+
|6052791|0.32673267326732675|
+-------+-------------------+

我仍然不确定如何选择其他列(示例中为4385和4390)。您似乎围绕ratio并搜索其他匹配的sid

答案 1 :(得分:1)

你需要一个groupby的列,为此我要添加一个名为sNo的新列。

  import sqlContext.implicits._
  import org.apache.spark.sql.functions._

  val df = sc.parallelize(List((6052791, 4178, 0.42673267326732675),
    (6052791, 4178, 0.22673267326732675),
    (6052791, 4179, 0.62673267326732675),
    (6052791, 4180, 0.72673267326732675),
    (6052791, 4179, 0.82673267326732675),
    (6052791, 4179, 0.92673267326732675))).toDF("id", "sid", "ratio")

  df.withColumn("sNo", lit(1))
    .groupBy("sNo")
    .pivot("sid")
    .agg(min("ratio"))
    .show

这将返回输出

+---+-------------------+------------------+------------------+
|sNo|               4178|              4179|              4180|
+---+-------------------+------------------+------------------+
|  1|0.22673267326732674|0.6267326732673267|0.7267326732673267|
+---+-------------------+------------------+------------------+