PySpark快速方法取消数据框

时间:2018-09-21 00:10:02

标签: apache-spark pyspark pyspark-sql

是否有一种快速有效的方法来取消数据框的显示?我已经使用了以下方法,尽管全部使用时都可以处理样本数据,但它要运行数小时,而且永远不会完成。

方法1:

def to_long(df, by):

  # Filter dtypes and split into column names and type description
  cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))
  # Spark SQL supports only homogeneous columns
  assert len(set(dtypes)) == 1, "All columns have to be of the same type"

  # Create and explode an array of (column_name, column_value) structs
  kvs = explode(array([
  struct(lit(c).alias("question_id"), col(c).alias("response_value")) for c in cols
])).alias("kvs")

return df.select(by + [kvs]).select(by + ["kvs.question_id", "kvs.response_value"])

方法2:

def rowExpander(row):
  rowDict = row.asDict()
  valA = rowDict.pop('user_id')
  for k in rowDict:
     yield Row(**{'user_id': valA , 'question_id' : k, 'response_value' : row[k]})

user_response_df = spark.createDataFrame(response_df.rdd.flatMap(rowExpander))

2 个答案:

答案 0 :(得分:0)

也许您可以尝试将每一列选择为新的数据框,然后合并所有列
像这样

   consumer_id  order_total  SID
0            1            5    1
1            2            6    2
2            3            7    3
3            1            5    1

答案 1 :(得分:0)

df.selectExpr('col1', 'stack(2, "col2", col2, "col3", col3) as (cols, values)')