在pyspark中按层次结构顺序检索行时出现数据帧性能问题

时间:2018-12-19 18:15:54

标签: python pyspark

在pyspark中按层次结构顺序检索行时出现数据帧性能问题。

在pyspark中按层次结构顺序检索行时出现数据帧性能问题

我正在尝试使用pyspark数据帧从csv文件中按层级顺序检索数据,但要花3个小时以上的时间才能按层级顺序检索3万条记录。

在pyspark数据框中是否有其他方法可以解决此问题?

有人可以帮我吗?

from datetime import datetime
    from pyspark.sql.functions import lit
    df = sc.read.csv(path/of/csv/file, **kargs)
    df.cache()
    df.show()

    def get_child(pid, df, col_name):
       df_child_s = df.selectExpr(col_name).where(col("pid") == pid)
       return df_child_s


    def all_data(pid, df, col_name):
       df_child_exist = True
       cnt = 0
       df_o = get_child_str(pid, df, col_name)

       df_o = df_o.withColumn("order_id", lit(cnt))

       df_child_exist = len(df_o.take(1)) >= 1
       if df_child_exist :
           dst = df_o.selectExpr("child_id").first()[0]

       while df_child_exist:
           cnt += 1


           df_o2 = get_child_str(dst, df, "*")
           df_o2 = df_o2.withColumn("order_id", lit(cnt))

           df_child_exist = len(df_o2.take(1)) >= 1
           if df_child_exist :

               dst = df_o2.selectExpr("childid_id").first()[0]
               df_o = df_o.union(df_o2)

       return df_o



    pid = 0
    start = datetime.now()
    df_f_1 = all_data(pid, df, "*")
    df_f_1.show()
    end = datetime.now()
    totalTime = end - start
    print(f"total execution time :{totalTime}")
**csv file data** childid parentid 248278 264543 251713 252689 252689 248278 258977 251713 264543 0 **expected output result:** childId parentId 264543 0 248278 264543 252689 248278 251713 252689 OR +------+------+-------+ | dst| src|level| +------+------+-------+ |264543| 0| 0| |248278|264543| 1| |252689|248278| 2| |251713|252689| 3| |258977|251713| 4|| +------+------+-------+

2 个答案:

答案 0 :(得分:1)

Raj,这是我要求的graphFrame答案。

我认为使用GraphFrames可以做到这一点。我没有找到一种简单的方法来找到所有后裔。我提供两种解决方案。

from graphframes import GraphFrame
from pyspark.sql.functions import col

# initial dataframe
edgesDf = spark.createDataFrame([
    (248278, 264543),
    (251713, 252689),
    (252689, 248278),
    (258977, 251713),
    (264543, 0)
  ],
  ["dst", "src"]
)

# get all ids as vertices
verticesDf = edgesDf.select(col("dst").alias("id")).union(edgesDf.select("src")).distinct()

# create graphFrame
graphGf = GraphFrame(verticesDf, edgesDf)

# for performance
sc.setCheckpointDir("/tmp/checkpoints")
graphGf.cache()

####  Motif approach
# note that this requires knowing the depth of the tree
fullPathDf = graphGf.find("(a)-[ab]->(b); (b)-[bc]->(c); (c)-[cd]->(d); (d)-[de]->(e); (e)-[ef]->(f)")

# pivot
edgeDf = fullPathDf.select(col("ab").alias("edge")).union(fullPathDf.select("bc")).union(fullPathDf.select("cd")).union(fullPathDf.select("de")).union(fullPathDf.select("ef"))

# Result 
edgeDf.select("edge.dst", "edge.src").show()

### Breadth First Search approach
# 
# Does not require knowing the depth, but does require knowing the id of the leaf node
pathDf = graphGf.bfs("id = 0", "id = 258977", maxPathLength = 5)

# pivot
edgeDf = pathDf.select(col("e0").alias("edge")).union(pathDf.select("e1")).union(pathDf.select("e2")).union(pathDf.select("e3")).union(pathDf.select("e4")

#
edgeDf.select("edge.dst", "edge.src").show()

答案 1 :(得分:0)

我建议在您的代码中添加一个数据框checkpoint()。这样可以防止数据帧沿袭过长而导致性能问题。您的代码似乎有许多数据框,但我不清楚为什么要创建多个数据框,因此我不确定哪个数据框将从检查点中受益。将检查点添加到您在每次迭代中修改的数据框。 Here是对检查点的一个很好的pyspark解释