Question

我正在使用Spark 2.3版，并尝试以以下方式读取Spark中的配置单元表：

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
df = spark.table("emp.emptable")

在这里，我将系统当前日期添加到现有数据框的新列

import pyspark.sql.functions as F
newdf = df.withColumn('LOAD_DATE', F.current_date())

当我试图将此数据帧编写为配置单元表时，现在面临一个问题

newdf.write.mode("overwrite").saveAsTable("emp.emptable")

pyspark.sql.utils.AnalysisException: u'Cannot overwrite table emp.emptable that is also being read from;'

所以我要检查数据帧以破坏谱系，因为我正在从同一数据帧读取和写入

checkpointDir = "/hdfs location/temp/tables/"
spark.sparkContext.setCheckpointDir(checkpointDir)
df = spark.table("emp.emptable").coalesce(1).checkpoint()
newdf = df.withColumn('LOAD_DATE', F.current_date())
newdf.write.mode("overwrite").saveAsTable("emp.emptable")

这样，它可以正常工作，并且新列已添加到配置单元表中。但是每次创建检查点文件时，我都必须删除它。有什么最好的方法来打破血统，并使用更新的列详细信息写入相同的数据框，并将其保存到hdfs位置或作为配置单元表。

或者有什么方法可以为检查点目录指定一个临时位置，在Spark会话完成后，该位置将被删除。

Answer 1

正如我们在this帖子中所讨论的，将下面的属性设置为可行的方法。

spark.conf.set("spark.cleaner.referenceTracking.cleanCheckpoints", "true")

这个问题有不同的背景。我们希望保留checkpointed数据集，因此不必在意添加清理解决方案。

设置上述属性可以 某个时候起作用 （经过测试的scala，java和python），但是难以依靠。官方文档说，通过将其设置为Controls whether to clean checkpoint files if the reference is out of scope.我不知道它的确切含义，因为我的理解是，一旦spark会话/上下文停止，就应该清理它。如果有人可以轻描淡写，那就太好了。

关于

有什么最好的方法可以打破血统

检查this问题，@ BiS找到了一种使用createDataFrame(RDD, Schema)方法削减谱系的方法。我还没有自己测试过。

仅供参考，我通常不依赖上述属性，而是为了安全起见删除代码中的checkpointed目录。

我们可以得到checkpointed目录，如下所示：

scala：

//Set directory
scala> spark.sparkContext.setCheckpointDir("hdfs:///tmp/checkpoint/")

scala> spark.sparkContext.getCheckpointDir.get
res3: String = hdfs://<name-node:port>/tmp/checkpoint/625034b3-c6f1-4ab2-9524-e48dfde589c3

//It gives String so we can use org.apache.hadoop.fs to delete path

PySpark：

// Set directory
>>> spark.sparkContext.setCheckpointDir('hdfs:///tmp/checkpoint')
>>> t = sc._jsc.sc().getCheckpointDir().get()
>>> t 
u'hdfs://<name-node:port>/tmp/checkpoint/dc99b595-f8fa-4a08-a109-23643e2325ca'

# notice 'u' at the start which means It returns unicode object use str(t)
# Below are the steps to get hadoop file system object and delete

>>> fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())
fs.exists(sc._jvm.org.apache.hadoop.fs.Path(str(t)))
True

>>> fs.delete(sc._jvm.org.apache.hadoop.fs.Path(str(t)))
True

>>> fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())
fs.exists(sc._jvm.org.apache.hadoop.fs.Path(str(t)))
False

从配置单元表读取并在pyspark中更新同一表-使用检查点

1 个答案: