在pyspark中按层次结构顺序检索行时出现数据帧性能问题。
在pyspark中按层次结构顺序检索行时出现数据帧性能问题
我正在尝试使用pyspark数据帧从csv文件中按层级顺序检索数据,但要花3个小时以上的时间才能按层级顺序检索3万条记录。
在pyspark数据框中是否有其他方法可以解决此问题?
有人可以帮我吗?
**csv file data** childid parentid 248278 264543 251713 252689 252689 248278 258977 251713 264543 0 **expected output result:** childId parentId 264543 0 248278 264543 252689 248278 251713 252689 OR +------+------+-------+ | dst| src|level| +------+------+-------+ |264543| 0| 0| |248278|264543| 1| |252689|248278| 2| |251713|252689| 3| |258977|251713| 4|| +------+------+-------+from datetime import datetime from pyspark.sql.functions import lit df = sc.read.csv(path/of/csv/file, **kargs) df.cache() df.show() def get_child(pid, df, col_name): df_child_s = df.selectExpr(col_name).where(col("pid") == pid) return df_child_s def all_data(pid, df, col_name): df_child_exist = True cnt = 0 df_o = get_child_str(pid, df, col_name) df_o = df_o.withColumn("order_id", lit(cnt)) df_child_exist = len(df_o.take(1)) >= 1 if df_child_exist : dst = df_o.selectExpr("child_id").first()[0] while df_child_exist: cnt += 1 df_o2 = get_child_str(dst, df, "*") df_o2 = df_o2.withColumn("order_id", lit(cnt)) df_child_exist = len(df_o2.take(1)) >= 1 if df_child_exist : dst = df_o2.selectExpr("childid_id").first()[0] df_o = df_o.union(df_o2) return df_o pid = 0 start = datetime.now() df_f_1 = all_data(pid, df, "*") df_f_1.show() end = datetime.now() totalTime = end - start print(f"total execution time :{totalTime}")
答案 0 :(得分:1)
Raj,这是我要求的graphFrame答案。
我认为使用GraphFrames可以做到这一点。我没有找到一种简单的方法来找到所有后裔。我提供两种解决方案。
from graphframes import GraphFrame
from pyspark.sql.functions import col
# initial dataframe
edgesDf = spark.createDataFrame([
(248278, 264543),
(251713, 252689),
(252689, 248278),
(258977, 251713),
(264543, 0)
],
["dst", "src"]
)
# get all ids as vertices
verticesDf = edgesDf.select(col("dst").alias("id")).union(edgesDf.select("src")).distinct()
# create graphFrame
graphGf = GraphFrame(verticesDf, edgesDf)
# for performance
sc.setCheckpointDir("/tmp/checkpoints")
graphGf.cache()
#### Motif approach
# note that this requires knowing the depth of the tree
fullPathDf = graphGf.find("(a)-[ab]->(b); (b)-[bc]->(c); (c)-[cd]->(d); (d)-[de]->(e); (e)-[ef]->(f)")
# pivot
edgeDf = fullPathDf.select(col("ab").alias("edge")).union(fullPathDf.select("bc")).union(fullPathDf.select("cd")).union(fullPathDf.select("de")).union(fullPathDf.select("ef"))
# Result
edgeDf.select("edge.dst", "edge.src").show()
### Breadth First Search approach
#
# Does not require knowing the depth, but does require knowing the id of the leaf node
pathDf = graphGf.bfs("id = 0", "id = 258977", maxPathLength = 5)
# pivot
edgeDf = pathDf.select(col("e0").alias("edge")).union(pathDf.select("e1")).union(pathDf.select("e2")).union(pathDf.select("e3")).union(pathDf.select("e4")
#
edgeDf.select("edge.dst", "edge.src").show()
答案 1 :(得分:0)
我建议在您的代码中添加一个数据框checkpoint()。这样可以防止数据帧沿袭过长而导致性能问题。您的代码似乎有许多数据框,但我不清楚为什么要创建多个数据框,因此我不确定哪个数据框将从检查点中受益。将检查点添加到您在每次迭代中修改的数据框。 Here是对检查点的一个很好的pyspark解释