使用Spark将列名称附加到列值

时间:2019-08-12 10:44:18

标签: pyspark apache-spark-sql azure-databricks fpgrowth

我的数据以逗号分隔,已将其加载到spark数据框中: 数据如下:

  A B C
  1 2 3
  4 5 6
  7 8 9

我想使用pyspark将上述数据帧转换为Spark:

   A    B   C
  A_1  B_2  C_3
  A_4  B_5  C_6
  --------------

然后使用pyspark将其转换为列表列表:

[[ A_1 , B_2 , C_3],[A_4 , B_5 , C_6]]

然后使用pyspark在上述数据集上运行FP Growth算法。

我尝试过的代码如下:

from pyspark.sql.functions import col, size
from pyspark.sql.functions import *
import pyspark.sql.functions as func
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from pyspark.ml.fpm import FPGrowth
from pyspark.sql import Row
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark import SparkConf
from pyspark.sql.types import StringType
from pyspark import SQLContext

sqlContext = SQLContext(sc)
df = spark.read.format("csv").option("header", "true").load("dbfs:/FileStore/tables/data.csv")

 names=df.schema.names

然后我想到在for循环内做一些事情:

 for name in names:
      -----
      ------

在此之后,我将使用fpgrowth:

df = spark.createDataFrame([
    (0, [ A_1 , B_2 , C_3]),
    (1, [A_4 , B_5 , C_6]),)], ["id", "items"])

fpGrowth = FPGrowth(itemsCol="items", minSupport=0.5, minConfidence=0.6)
model = fpGrowth.fit(df)

1 个答案:

答案 0 :(得分:1)

这里为使用Scala的用户提供了许多概念,这些概念通常显示了如何使用pyspark。多少有些不同,但可以肯定地学到一些,尽管有多少是个大问题。我当然知道自己使用zipWithIndex在pyspark上学习了一点。无论如何。

第一部分是将内容转换为所需的格式,可能也可以导入,但按原样保留:

from functools import reduce
from pyspark.sql.functions import lower, col, lit, concat, split
from pyspark.sql.types import * 
from pyspark.sql import Row
from pyspark.sql import functions as f

source_df = spark.createDataFrame(
   [
    (1, 11, 111),
    (2, 22, 222)
   ],
   ["colA", "colB", "colC"]
                                 )

intermediate_df = (reduce(
                    lambda df, col_name: df.withColumn(col_name, concat(lit(col_name), lit("_"), col(col_name))),
                    source_df.columns,
                    source_df
                   )     )

allCols = [x for x in intermediate_df.columns]
result_df = intermediate_df.select(f.concat_ws(',', *allCols).alias('CONCAT_COLS'))

result_df = result_df.select(split(col("CONCAT_COLS"), ",\s*").alias("ARRAY_COLS"))

# Add 0,1,2,3, ... with zipWithIndex, we add it at back, but that does not matter, you can move it around.
# Get new Structure, the fields (one in this case but done flexibly, plus zipWithIndex value.
schema = StructType(result_df.schema.fields[:] + [StructField("index", LongType(), True)])

# Need this dict approach with pyspark, different to Scala.
rdd = result_df.rdd.zipWithIndex()
rdd1 = rdd.map(
               lambda row: tuple(row[0].asDict()[c] for c in schema.fieldNames()[:-1]) + (row[1],)
              )

final_result_df = spark.createDataFrame(rdd1, schema)
final_result_df.show(truncate=False)

返回:

 +---------------------------+-----+
 |ARRAY_COLS                 |index|
 +---------------------------+-----+
 |[colA_1, colB_11, colC_111]|0    |
 |[colA_2, colB_22, colC_222]|1    |
 +---------------------------+-----+

第二部分是带pyspark的旧zipWithIndex,如果您需要0,1,..与Scala相比痛苦。

通常在Scala中更容易解决。

不确定性能,而不是有趣的foldLeft。我认为实际上可以。