Question

在pyspark中按层次结构顺序检索行时出现数据帧性能问题。

在pyspark中按层次结构顺序检索行时出现数据帧性能问题

我正在尝试使用pyspark数据帧从csv文件中按层级顺序检索数据，但要花3个小时以上的时间才能按层级顺序检索3万条记录。

在pyspark数据框中是否有其他方法可以解决此问题？

有人可以帮我吗？



from datetime import datetime
    from pyspark.sql.functions import lit
    df = sc.read.csv(path/of/csv/file, **kargs)
    df.cache()
    df.show()

    def get_child(pid, df, col_name):
       df_child_s = df.selectExpr(col_name).where(col("pid") == pid)
       return df_child_s


    def all_data(pid, df, col_name):
       df_child_exist = True
       cnt = 0
       df_o = get_child_str(pid, df, col_name)

       df_o = df_o.withColumn("order_id", lit(cnt))

       df_child_exist = len(df_o.take(1)) >= 1
       if df_child_exist :
           dst = df_o.selectExpr("child_id").first()[0]

       while df_child_exist:
           cnt += 1


           df_o2 = get_child_str(dst, df, "*")
           df_o2 = df_o2.withColumn("order_id", lit(cnt))

           df_child_exist = len(df_o2.take(1)) >= 1
           if df_child_exist :

               dst = df_o2.selectExpr("childid_id").first()[0]
               df_o = df_o.union(df_o2)

       return df_o



    pid = 0
    start = datetime.now()
    df_f_1 = all_data(pid, df, "*")
    df_f_1.show()
    end = datetime.now()
    totalTime = end - start
    print(f"total execution time :{totalTime}")





**csv file data**    

childid    parentid
248278    264543
251713    252689
252689    248278
258977    251713
264543    0

**expected output result:**    

childId     parentId    
264543       0
248278       264543
252689       248278
251713       252689

    OR
+------+------+-------+
|   dst|   src|level|
+------+------+-------+
|264543|     0|      0|
|248278|264543|      1|
|252689|248278|      2|
|251713|252689|      3|
|258977|251713|      4||
+------+------+-------+

Answer 1

Raj，这是我要求的graphFrame答案。

我认为使用GraphFrames可以做到这一点。我没有找到一种简单的方法来找到所有后裔。我提供两种解决方案。

from graphframes import GraphFrame
from pyspark.sql.functions import col

# initial dataframe
edgesDf = spark.createDataFrame([
    (248278, 264543),
    (251713, 252689),
    (252689, 248278),
    (258977, 251713),
    (264543, 0)
  ],
  ["dst", "src"]
)

# get all ids as vertices
verticesDf = edgesDf.select(col("dst").alias("id")).union(edgesDf.select("src")).distinct()

# create graphFrame
graphGf = GraphFrame(verticesDf, edgesDf)

# for performance
sc.setCheckpointDir("/tmp/checkpoints")
graphGf.cache()

####  Motif approach
# note that this requires knowing the depth of the tree
fullPathDf = graphGf.find("(a)-[ab]->(b); (b)-[bc]->(c); (c)-[cd]->(d); (d)-[de]->(e); (e)-[ef]->(f)")

# pivot
edgeDf = fullPathDf.select(col("ab").alias("edge")).union(fullPathDf.select("bc")).union(fullPathDf.select("cd")).union(fullPathDf.select("de")).union(fullPathDf.select("ef"))

# Result 
edgeDf.select("edge.dst", "edge.src").show()

### Breadth First Search approach
# 
# Does not require knowing the depth, but does require knowing the id of the leaf node
pathDf = graphGf.bfs("id = 0", "id = 258977", maxPathLength = 5)

# pivot
edgeDf = pathDf.select(col("e0").alias("edge")).union(pathDf.select("e1")).union(pathDf.select("e2")).union(pathDf.select("e3")).union(pathDf.select("e4")

#
edgeDf.select("edge.dst", "edge.src").show()

Answer 2

我建议在您的代码中添加一个数据框checkpoint（）。这样可以防止数据帧沿袭过长而导致性能问题。您的代码似乎有许多数据框，但我不清楚为什么要创建多个数据框，因此我不确定哪个数据框将从检查点中受益。将检查点添加到您在每次迭代中修改的数据框。 Here是对检查点的一个很好的pyspark解释

在pyspark中按层次结构顺序检索行时出现数据帧性能问题

2 个答案: