在spark数据帧写入方法中覆盖特定分区

时间:2016-07-20 18:00:37

标签: apache-spark apache-spark-sql spark-dataframe

我想覆盖特定的分区,而不是所有的火花。我正在尝试以下命令:

df.write.orc('maprfs:///hdfs-base-path','overwrite',partitionBy='col4')

其中df是具有要覆盖的增量数据的数据帧。

hdfs-base-path包含主数据。

当我尝试上面的命令时,它会删除所有分区,并在hdfs路径中插入df中的那些分区。

我的要求是只覆盖指定hdfs路径中df中存在的那些分区。有人可以帮我吗?

13 个答案:

答案 0 :(得分:56)

最后!这是Spark 2.3.0中的一个功能: https://issues.apache.org/jira/browse/SPARK-20236

要使用它,您需要将 spark.sql.sources.partitionOverwriteMode 设置为动态,需要对数据集进行分区,并将写入模式覆盖。例如:

spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
data.write.mode("overwrite").insertInto("partitioned_table")

我建议您在写入之前根据分区列进行重新分区,这样您就不会在每个文件夹中找到400个文件。

在Spark 2.3.0之前,最好的解决方案是启动SQL语句来删除这些分区,然后使用mode append编写它们。

答案 1 :(得分:40)

这是一个常见问题。 Spark高达2.0的唯一解决方案是直接写入分区目录,例如

_SUCCESS

如果您使用2.0之前的Spark,则需要使用以下命令阻止Spark发出元数据文件(因为它们会破坏自动分区发现):

/root/path/to/data/partition_col=value

如果您在1.6.2之前使用Spark,则还需要删除 .. VB script to create some hexagon tiles. It creates tiles of a given radius (circle around a tile), not side length. If you want side length you will need to do the math and updte the code. Here is the Comments from the .map file: Creating a drawing of hexagon tiles: R - radius Radius of circumscribed circle around tile a - Apothem: Distance from centroid perpendicular to a side a = R * cos(1/2 * 360/n) (n=6 for a hexagon A set of hexagon tiles would be a series of six sided "circles" centered on two point grids (1 & 2). Both grids would have the spacing of: in X --- 3R in Y --- 2a Grid 2 would be offset from grid 1 by: in X ---- 3R/2 in Y ---- 2a/2 To test script delete all objects in A then run the script. This sample was only tested with a lat/long drawing. I'm not sure of all the ramifications of using a projected drawing. To use with your data set the start point (upper left) in the script and desired radius. Set precision and run Normailize Topology when done to join the tiles. Code was based on the FreeStuff sample scripts ScriptRandomPoints and ScriptSpatialOperations. Please post any problems you find with this code. Hmmm.. the attachments option is gone? :-? Send me your address via email and send the .map file if you'd like. Here's the code: Sub Main ' test lat/long drawing ' ** ** delete all objects in A to test set drawing = Application.ActiveDocument.ComponentSet("A") set objects = drawing.ObjectSet sides = 6 pi = 3.14159 R = 2.5 ' radius in degrees interiorAngle = (360/6) * (pi / 180) ' in radians a = abs(R * cos(0.5 * interiorAngle)) ' apothem ' pick/make a start point - upper left Set startPoint = Application.NewPoint startPoint.X = -25 startPoint.Y = 73.6602540378444 ' grid (4x3x2) for i = 0 to 3 for j = 0 to 2 ' -- create point grid 1 Set point = Application.NewPoint point.X = startPoint.X + (i * 3 * R) point.Y = startPoint.Y - (j * 2 * a) ' objects.Add Application.NewGeom(GeomPoint, point) ' centroid Set pointSet = Application.NewPointSet For k = 0 To sides -1 Set pt = Application.NewPoint ' calculate angle angle = (k*2*Pi/sides)' - (360/sides)/2 ' obtain point on circle pt.X = point.X + R*Cos(angle) pt.Y = point.Y + R*Sin(angle) pointSet.Add(pt) Next objects.Add Application.NewGeom(GeomArea, pointSet) ' -- create point grid 2 Set point = Application.NewPoint point.X = startPoint.X + (i * 3 * R) + ((3 * R)/2) point.Y = startPoint.Y - (j * 2 * a) - a ' objects.Add Application.NewGeom(GeomPoint, point) ' centroid Set pointSet = Application.NewPointSet For k = 0 To sides -1 Set pt = Application.NewPoint ' calculate angle angle = (k*2*Pi/sides)' - (360/sides)/2 ' obtain point on circle pt.X = point.X + R*Cos(angle) pt.Y = point.Y + R*Sin(angle) pointSet.Add(pt) Next objects.Add Application.NewGeom(GeomArea, pointSet) next next msgbox "Done!" End Sub 中的 Sub xx() Dim startPoint As clsPoint Dim Point As clsPoint Dim pt As clsPoint Dim pts As Collection Dim s As String ' lat/long (western hemisphere?) Dim sides, i, j, k As Integer Dim Pi, R, interiorAngle, A, Angle As Double sides = 6 Pi = 3.14159 R = 0.25 ' radius in degrees interiorAngle = (360 / 6) * (Pi / 180) ' in radians A = Abs(R * Cos(0.5 * interiorAngle)) ' apothem ' pick/make a start point - upper left Set startPoint = New clsPoint startPoint.X = -121.5 startPoint.Y = 35.5 s = "Longitude" & vbTab & "Latitude" & vbCrLf s = s & startPoint.X & vbTab & startPoint.Y & vbCrLf Set Point = New clsPoint Point.X = startPoint.X '+ (i * 3 * R) Point.Y = startPoint.Y '- (j * 2 * A) Set pts = New Collection For k = 0 To sides - 1 Set pt = New clsPoint ' calculate angle Angle = (k * 2 * Pi / sides) ' - (360/sides)/2 ' Debug.Print Angle ' obtain point on circle pt.X = Point.X + R * Cos(Angle) pt.Y = Point.Y + R * Sin(Angle) pts.Add pt Next For Each pt In pts s = s & pt.X & vbTab & pt.Y & vbCrLf Next Debug.Print s Stop End Sub 文件,否则其存在将破坏自动分区发现。 (我强烈建议使用1.6.2或更高版本。)

您可以在Bulletproof Jobs上的Spark Summit谈话中获取有关如何管理大型分区表的更多详细信息。

答案 2 :(得分:6)

使用Spark 1.6 ......

HiveContext可以极大地简化此过程。关键是您必须首先使用定义了分区的CREATE EXTERNAL TABLE语句在Hive中创建表。例如:

# Hive SQL
CREATE EXTERNAL TABLE test
(name STRING)
PARTITIONED BY
(age INT)
STORED AS PARQUET
LOCATION 'hdfs:///tmp/tables/test'

从这里开始,假设您有一个Dataframe,其中包含特定分区(或多个分区)的新记录。您可以使用HiveContext SQL语句使用此Dataframe执行INSERT OVERWRITE,这将仅覆盖Dataframe中包含的分区的表:

# PySpark
hiveContext = HiveContext(sc)
update_dataframe.registerTempTable('update_dataframe')

hiveContext.sql("""INSERT OVERWRITE TABLE test PARTITION (age)
                   SELECT name, age
                   FROM update_dataframe""")

注意:此示例中的update_dataframe具有与目标test表的架构匹配的架构。

使用此方法的一个简单错误是跳过Hive中的CREATE EXTERNAL TABLE步骤,并使用Dataframe API的写入方法创建表。特别是对于基于Parquet的表,该表将不会被正确定义以支持Hive的INSERT OVERWRITE... PARTITION函数。

希望这会有所帮助。

答案 3 :(得分:2)

我尝试了以下方法来覆盖HIVE表中的特定分区。

### load Data and check records
    raw_df = spark.table("test.original")
    raw_df.count()

lets say this table is partitioned based on column : **c_birth_year** and we would like to update the partition for year less than 1925


### Check data in few partitions.
    sample = raw_df.filter(col("c_birth_year") <= 1925).select("c_customer_sk", "c_preferred_cust_flag")
    print "Number of records: ", sample.count()
    sample.show()


### Back-up the partitions before deletion
    raw_df.filter(col("c_birth_year") <= 1925).write.saveAsTable("test.original_bkp", mode = "overwrite")


### UDF : To delete particular partition.
    def delete_part(table, part):
        qry = "ALTER TABLE " + table + " DROP IF EXISTS PARTITION (c_birth_year = " + str(part) + ")"
        spark.sql(qry)


### Delete partitions
    part_df = raw_df.filter(col("c_birth_year") <= 1925).select("c_birth_year").distinct()
    part_list = part_df.rdd.map(lambda x : x[0]).collect()

    table = "test.original"
    for p in part_list:
        delete_part(table, p)


### Do the required Changes to the columns in partitions
    df = spark.table("test.original_bkp")
    newdf = df.withColumn("c_preferred_cust_flag", lit("Y"))
    newdf.select("c_customer_sk", "c_preferred_cust_flag").show()


### Write the Partitions back to Original table
    newdf.write.insertInto("test.original")


### Verify data in Original table
    orginial.filter(col("c_birth_year") <= 1925).select("c_customer_sk", "c_preferred_cust_flag").show()



Hope it helps.

Regards,

Neeraj

答案 4 :(得分:1)

如果您使用DataFrame,可能您希望使用Hive表而不是数据。 在这种情况下,您只需要调用方法

df.write.mode(SaveMode.Overwrite).partitionBy("partition_col").insertInto(table_name)

它会覆盖DataFrame包含的分区。

没有必要指定格式(orc),因为Spark将使用Hive表格式。

它在Spark 1.6版中运行良好

答案 5 :(得分:1)

我建议您创建一个类似于目标表的临时表,然后在其中插入数据。而不是直接写入目标表。

CREATE TABLE tmpTbl LIKE trgtTbl LOCATION '<tmpLocation';

创建表后,您将数据写入tmpLocation

df.write.mode("overwrite").partitionBy("p_col").orc(tmpLocation)

然后您将通过执行以下操作恢复表分区路径:

MSCK REPAIR TABLE tmpTbl;

通过查询Hive元数据来获取分区路径,例如:

SHOW PARTITONS tmpTbl;

trgtTbl删除这些分区,并将目录从tmpTbl移至trgtTbl

答案 6 :(得分:1)

作为“ Jatin Wrote”,您可以从配置单元和路径中删除分区,然后追加数据 由于我浪费了太多时间,因此为其他spark用户添加了以下示例。 我在Spark 2.2.1中使用了Scala

  import org.apache.hadoop.conf.Configuration
  import org.apache.hadoop.fs.Path
  import org.apache.spark.SparkConf
  import org.apache.spark.sql.{Column, DataFrame, SaveMode, SparkSession}

  case class DataExample(partition1: Int, partition2: String, someTest: String, id: Int)

 object StackOverflowExample extends App {
//Prepare spark & Data
val sparkConf = new SparkConf()
sparkConf.setMaster(s"local[2]")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
val tableName = "my_table"

val partitions1 = List(1, 2)
val partitions2 = List("e1", "e2")
val partitionColumns = List("partition1", "partition2")
val myTablePath = "/tmp/some_example"

val someText = List("text1", "text2")
val ids = (0 until 5).toList

val listData = partitions1.flatMap(p1 => {
  partitions2.flatMap(p2 => {
    someText.flatMap(
      text => {
        ids.map(
          id => DataExample(p1, p2, text, id)
        )
      }
    )
  }
  )
})

val asDataFrame = spark.createDataFrame(listData)

//Delete path function
def deletePath(path: String, recursive: Boolean): Unit = {
  val p = new Path(path)
  val fs = p.getFileSystem(new Configuration())
  fs.delete(p, recursive)
}

def tableOverwrite(df: DataFrame, partitions: List[String], path: String): Unit = {
  if (spark.catalog.tableExists(tableName)) {
    //clean partitions
    val asColumns = partitions.map(c => new Column(c))
    val relevantPartitions = df.select(asColumns: _*).distinct().collect()
    val partitionToRemove = relevantPartitions.map(row => {
      val fields = row.schema.fields
      s"ALTER TABLE ${tableName} DROP IF EXISTS PARTITION " +
        s"${fields.map(field => s"${field.name}='${row.getAs(field.name)}'").mkString("(", ",", ")")} PURGE"
    })

    val cleanFolders = relevantPartitions.map(partition => {
      val fields = partition.schema.fields
      path + fields.map(f => s"${f.name}=${partition.getAs(f.name)}").mkString("/")
    })

    println(s"Going to clean ${partitionToRemove.size} partitions")
    partitionToRemove.foreach(partition => spark.sqlContext.sql(partition))
    cleanFolders.foreach(partition => deletePath(partition, true))
  }
  asDataFrame.write
    .options(Map("path" -> myTablePath))
    .mode(SaveMode.Append)
    .partitionBy(partitionColumns: _*)
    .saveAsTable(tableName)
}

//Now test
tableOverwrite(asDataFrame, partitionColumns, tableName)
spark.sqlContext.sql(s"select * from $tableName").show(1000)
tableOverwrite(asDataFrame, partitionColumns, tableName)

import spark.implicits._

val asLocalSet = spark.sqlContext.sql(s"select * from $tableName").as[DataExample].collect().toSet
if (asLocalSet == listData.toSet) {
  println("Overwrite is working !!!")
}

}

答案 7 :(得分:1)

spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
data.toDF().write.mode("overwrite").format("parquet").partitionBy("date", "name").save("s3://path/to/somewhere")

这对我适用于AWS Glue ETL作业(Glue 1.0-Spark 2.4-Python 2)

答案 8 :(得分:0)

你可以做这样的事情来使工作可以重入(幂等): (在火花2.2上试过这个)

# drop the partition
drop_query = "ALTER TABLE table_name DROP IF EXISTS PARTITION (partition_col='{val}')".format(val=target_partition)
print drop_query
spark.sql(drop_query)

# delete directory
dbutils.fs.rm(<partition_directoy>,recurse=True)

# Load the partition
df.write\
  .partitionBy("partition_col")\
  .saveAsTable(table_name, format = "parquet", mode = "append", path = <path to parquet>)

答案 9 :(得分:0)

我建议您进行清理,然后使用Append模式编写新分区:

import scala.sys.process._
def deletePath(path: String): Unit = {
    s"hdfs dfs -rm -r -skipTrash $path".!
}

df.select(partitionColumn).distinct.collect().foreach(p => {
    val partition = p.getAs[String](partitionColumn)
    deletePath(s"$path/$partitionColumn=$partition")
})

df.write.partitionBy(partitionColumn).mode(SaveMode.Append).orc(path)

这将仅删除新分区。写入数据后,如果需要更新Metastore,请运行此命令:

sparkSession.sql(s"MSCK REPAIR TABLE $db.$table")

注意: deletePath假设您的系统上有hfds命令。

答案 10 :(得分:0)

在Scala的Spark 2.3.1上进行了测试。 上面的大多数答案都写入Hive表。但是,我想直接写入磁盘,该磁盘顶部有一个external hive table

首先需要配置

val sparkSession: SparkSession = SparkSession
      .builder
      .enableHiveSupport()
      .config("spark.sql.sources.partitionOverwriteMode", "dynamic") // Required for overwriting ONLY the required partitioned folders, and not the entire root folder
      .appName("spark_write_to_dynamic_partition_folders")

在这里使用:

DataFrame
.write
.format("<required file format>")
.partitionBy("<partitioned column name>")
.mode(SaveMode.Overwrite) // This is required.
.save(s"<path_to_root_folder>")

答案 11 :(得分:0)

在insertInto语句中添加'overwrite = True'参数可解决此问题:

hiveContext.setConf("hive.exec.dynamic.partition", "true")
hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")

df.write.mode("overwrite").insertInto("database_name.partioned_table", overwrite=True)

默认为overwrite=False。将其更改为True可使我们覆盖df和partioned_table中包含的特定分区。这有助于我们避免使用df覆盖parteded_table的全部内容。

答案 12 :(得分:0)

对于> = Spark 2.3.0:

spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
data.write.insertInto("partitioned_table", overwrite=True)