Question

更新：添加了repartition和persist。

我有一个数据框（数据来自json），其中包含一列名为table的列，我实际上需要使用该列将其拆分为多个不同的新数据框。一些样本数据：

+--------------+-------------+
|         table|    timestamp|
+--------------+-------------+
|       A_TABLE|1573085110000|
|       A_TABLE|1573171204000|
|       A_TABLE|1572912308000|
|AN_OTHER_TABLE|1573171513000|
|AN_OTHER_TABLE|1572912020000|
|AN_OTHER_TABLE|1573084819000|
+--------------+-------------+

我不知道在col table中将有多少个不同的值，我也不知道它们将实际包含什么。.我考虑了对{{1 }}，但我认为这样做不可行，因为我需要随后对数据进行轮换。而且它们全都不同。

所以，我做到了，根据我自己的承认，这有点麻烦：

table

现在，这是“确定”，因为我只需要处理一些事情。但是我忍不住认为这根本无法扩展。

是否有一种聪明的方法来重组它，以便使操作并行完成。.我以为它也在头节点上运行就正确了吗？

我考虑做这样的事情，除了降低可读性外，不确定是否可以使用rdd获得任何东西！：

# Repartition & persist
df = df.repartition("table")
df.persist()

tblNames = [row.table for row in df.select("table").distinct().collect()]

# Split the frame..
for tblName in tblNames:
  filtered_df = df.filter(df.table == tblName)

  # Do some work with 'filtered_df', a bit of exploding, pivoting, etc..
  # This results in a new dataframe, here named out_df

  # output
  out_path = "/mnt/path/auto/"+ tblName +"/"  
  out_df.write.format("delta").save(out_path)

结束思考..也许我可以将所有“工作”移动到udf或其他内容..并同时在tblNames = df.select("table").distinct().rdd.flatMap(lambda x: x).collect() dfs = [df.filter(df.table == tblName) for tblName in tblNames] dfs[0].show() # Now i need to loop through these.. have i gained anything? ?数组中的每个数据帧上运行它？ ?

谢谢！

PySpark在列值上并行拆分DataFame

0 个答案: