我想覆盖特定的分区,而不是所有的火花。我正在尝试以下命令:
df.write.orc('maprfs:///hdfs-base-path','overwrite',partitionBy='col4')
其中df是具有要覆盖的增量数据的数据帧。
hdfs-base-path包含主数据。
当我尝试上面的命令时,它会删除所有分区,并在hdfs路径中插入df中的那些分区。
我的要求是只覆盖指定hdfs路径中df中存在的那些分区。有人可以帮我吗?
答案 0 :(得分:56)
最后!这是Spark 2.3.0中的一个功能: https://issues.apache.org/jira/browse/SPARK-20236
要使用它,您需要将 spark.sql.sources.partitionOverwriteMode 设置为动态,需要对数据集进行分区,并将写入模式覆盖。例如:
spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
data.write.mode("overwrite").insertInto("partitioned_table")
我建议您在写入之前根据分区列进行重新分区,这样您就不会在每个文件夹中找到400个文件。
在Spark 2.3.0之前,最好的解决方案是启动SQL语句来删除这些分区,然后使用mode append编写它们。
答案 1 :(得分:40)
这是一个常见问题。 Spark高达2.0的唯一解决方案是直接写入分区目录,例如
_SUCCESS
如果您使用2.0之前的Spark,则需要使用以下命令阻止Spark发出元数据文件(因为它们会破坏自动分区发现):
/root/path/to/data/partition_col=value
如果您在1.6.2之前使用Spark,则还需要删除 .. VB script to create some hexagon tiles.
It creates tiles of a given radius (circle around a tile), not side length. If you want side length you will need to do the math and updte the code.
Here is the Comments from the .map file:
Creating a drawing of hexagon tiles:
R - radius
Radius of circumscribed circle around tile
a - Apothem:
Distance from centroid perpendicular to a side
a = R * cos(1/2 * 360/n) (n=6 for a hexagon
A set of hexagon tiles would be a series of six sided
"circles" centered on two point grids (1 & 2). Both grids
would have the spacing of:
in X --- 3R
in Y --- 2a
Grid 2 would be offset from grid 1 by:
in X ---- 3R/2
in Y ---- 2a/2
To test script delete all objects in A then
run the script.
This sample was only tested with a lat/long drawing. I'm not
sure of all the ramifications of using a projected drawing.
To use with your data set the start point (upper left) in the script and desired radius.
Set precision and run Normailize Topology when done to join
the tiles.
Code was based on the FreeStuff sample scripts ScriptRandomPoints
and ScriptSpatialOperations.
Please post any problems you find with this code.
Hmmm.. the attachments option is gone? :-?
Send me your address via email and send the .map file if you'd like.
Here's the code:
Sub Main
' test lat/long drawing
' ** ** delete all objects in A to test
set drawing = Application.ActiveDocument.ComponentSet("A")
set objects = drawing.ObjectSet
sides = 6
pi = 3.14159
R = 2.5 ' radius in degrees
interiorAngle = (360/6) * (pi / 180) ' in radians
a = abs(R * cos(0.5 * interiorAngle)) ' apothem
' pick/make a start point - upper left
Set startPoint = Application.NewPoint
startPoint.X = -25
startPoint.Y = 73.6602540378444
' grid (4x3x2)
for i = 0 to 3
for j = 0 to 2
' -- create point grid 1
Set point = Application.NewPoint
point.X = startPoint.X + (i * 3 * R)
point.Y = startPoint.Y - (j * 2 * a)
' objects.Add Application.NewGeom(GeomPoint, point) ' centroid
Set pointSet = Application.NewPointSet
For k = 0 To sides -1
Set pt = Application.NewPoint
' calculate angle
angle = (k*2*Pi/sides)' - (360/sides)/2
' obtain point on circle
pt.X = point.X + R*Cos(angle)
pt.Y = point.Y + R*Sin(angle)
pointSet.Add(pt)
Next
objects.Add Application.NewGeom(GeomArea, pointSet)
' -- create point grid 2
Set point = Application.NewPoint
point.X = startPoint.X + (i * 3 * R) + ((3 * R)/2)
point.Y = startPoint.Y - (j * 2 * a) - a
' objects.Add Application.NewGeom(GeomPoint, point) ' centroid
Set pointSet = Application.NewPointSet
For k = 0 To sides -1
Set pt = Application.NewPoint
' calculate angle
angle = (k*2*Pi/sides)' - (360/sides)/2
' obtain point on circle
pt.X = point.X + R*Cos(angle)
pt.Y = point.Y + R*Sin(angle)
pointSet.Add(pt)
Next
objects.Add Application.NewGeom(GeomArea, pointSet)
next
next
msgbox "Done!"
End Sub
中的 Sub xx()
Dim startPoint As clsPoint
Dim Point As clsPoint
Dim pt As clsPoint
Dim pts As Collection
Dim s As String
' lat/long (western hemisphere?)
Dim sides, i, j, k As Integer
Dim Pi, R, interiorAngle, A, Angle As Double
sides = 6
Pi = 3.14159
R = 0.25 ' radius in degrees
interiorAngle = (360 / 6) * (Pi / 180) ' in radians
A = Abs(R * Cos(0.5 * interiorAngle)) ' apothem
' pick/make a start point - upper left
Set startPoint = New clsPoint
startPoint.X = -121.5
startPoint.Y = 35.5
s = "Longitude" & vbTab & "Latitude" & vbCrLf
s = s & startPoint.X & vbTab & startPoint.Y & vbCrLf
Set Point = New clsPoint
Point.X = startPoint.X '+ (i * 3 * R)
Point.Y = startPoint.Y '- (j * 2 * A)
Set pts = New Collection
For k = 0 To sides - 1
Set pt = New clsPoint
' calculate angle
Angle = (k * 2 * Pi / sides) ' - (360/sides)/2
' Debug.Print Angle
' obtain point on circle
pt.X = Point.X + R * Cos(Angle)
pt.Y = Point.Y + R * Sin(Angle)
pts.Add pt
Next
For Each pt In pts
s = s & pt.X & vbTab & pt.Y & vbCrLf
Next
Debug.Print s
Stop
End Sub
文件,否则其存在将破坏自动分区发现。 (我强烈建议使用1.6.2或更高版本。)
您可以在Bulletproof Jobs上的Spark Summit谈话中获取有关如何管理大型分区表的更多详细信息。
答案 2 :(得分:6)
使用Spark 1.6 ......
HiveContext可以极大地简化此过程。关键是您必须首先使用定义了分区的CREATE EXTERNAL TABLE
语句在Hive中创建表。例如:
# Hive SQL
CREATE EXTERNAL TABLE test
(name STRING)
PARTITIONED BY
(age INT)
STORED AS PARQUET
LOCATION 'hdfs:///tmp/tables/test'
从这里开始,假设您有一个Dataframe,其中包含特定分区(或多个分区)的新记录。您可以使用HiveContext SQL语句使用此Dataframe执行INSERT OVERWRITE
,这将仅覆盖Dataframe中包含的分区的表:
# PySpark
hiveContext = HiveContext(sc)
update_dataframe.registerTempTable('update_dataframe')
hiveContext.sql("""INSERT OVERWRITE TABLE test PARTITION (age)
SELECT name, age
FROM update_dataframe""")
注意:此示例中的update_dataframe
具有与目标test
表的架构匹配的架构。
使用此方法的一个简单错误是跳过Hive中的CREATE EXTERNAL TABLE
步骤,并使用Dataframe API的写入方法创建表。特别是对于基于Parquet的表,该表将不会被正确定义以支持Hive的INSERT OVERWRITE... PARTITION
函数。
希望这会有所帮助。
答案 3 :(得分:2)
我尝试了以下方法来覆盖HIVE表中的特定分区。
### load Data and check records
raw_df = spark.table("test.original")
raw_df.count()
lets say this table is partitioned based on column : **c_birth_year** and we would like to update the partition for year less than 1925
### Check data in few partitions.
sample = raw_df.filter(col("c_birth_year") <= 1925).select("c_customer_sk", "c_preferred_cust_flag")
print "Number of records: ", sample.count()
sample.show()
### Back-up the partitions before deletion
raw_df.filter(col("c_birth_year") <= 1925).write.saveAsTable("test.original_bkp", mode = "overwrite")
### UDF : To delete particular partition.
def delete_part(table, part):
qry = "ALTER TABLE " + table + " DROP IF EXISTS PARTITION (c_birth_year = " + str(part) + ")"
spark.sql(qry)
### Delete partitions
part_df = raw_df.filter(col("c_birth_year") <= 1925).select("c_birth_year").distinct()
part_list = part_df.rdd.map(lambda x : x[0]).collect()
table = "test.original"
for p in part_list:
delete_part(table, p)
### Do the required Changes to the columns in partitions
df = spark.table("test.original_bkp")
newdf = df.withColumn("c_preferred_cust_flag", lit("Y"))
newdf.select("c_customer_sk", "c_preferred_cust_flag").show()
### Write the Partitions back to Original table
newdf.write.insertInto("test.original")
### Verify data in Original table
orginial.filter(col("c_birth_year") <= 1925).select("c_customer_sk", "c_preferred_cust_flag").show()
Hope it helps.
Regards,
Neeraj
答案 4 :(得分:1)
如果您使用DataFrame,可能您希望使用Hive表而不是数据。 在这种情况下,您只需要调用方法
df.write.mode(SaveMode.Overwrite).partitionBy("partition_col").insertInto(table_name)
它会覆盖DataFrame包含的分区。
没有必要指定格式(orc),因为Spark将使用Hive表格式。
它在Spark 1.6版中运行良好
答案 5 :(得分:1)
我建议您创建一个类似于目标表的临时表,然后在其中插入数据。而不是直接写入目标表。
CREATE TABLE tmpTbl LIKE trgtTbl LOCATION '<tmpLocation';
创建表后,您将数据写入tmpLocation
df.write.mode("overwrite").partitionBy("p_col").orc(tmpLocation)
然后您将通过执行以下操作恢复表分区路径:
MSCK REPAIR TABLE tmpTbl;
通过查询Hive元数据来获取分区路径,例如:
SHOW PARTITONS tmpTbl;
从trgtTbl
删除这些分区,并将目录从tmpTbl
移至trgtTbl
答案 6 :(得分:1)
作为“ Jatin Wrote”,您可以从配置单元和路径中删除分区,然后追加数据 由于我浪费了太多时间,因此为其他spark用户添加了以下示例。 我在Spark 2.2.1中使用了Scala
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path
import org.apache.spark.SparkConf
import org.apache.spark.sql.{Column, DataFrame, SaveMode, SparkSession}
case class DataExample(partition1: Int, partition2: String, someTest: String, id: Int)
object StackOverflowExample extends App {
//Prepare spark & Data
val sparkConf = new SparkConf()
sparkConf.setMaster(s"local[2]")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
val tableName = "my_table"
val partitions1 = List(1, 2)
val partitions2 = List("e1", "e2")
val partitionColumns = List("partition1", "partition2")
val myTablePath = "/tmp/some_example"
val someText = List("text1", "text2")
val ids = (0 until 5).toList
val listData = partitions1.flatMap(p1 => {
partitions2.flatMap(p2 => {
someText.flatMap(
text => {
ids.map(
id => DataExample(p1, p2, text, id)
)
}
)
}
)
})
val asDataFrame = spark.createDataFrame(listData)
//Delete path function
def deletePath(path: String, recursive: Boolean): Unit = {
val p = new Path(path)
val fs = p.getFileSystem(new Configuration())
fs.delete(p, recursive)
}
def tableOverwrite(df: DataFrame, partitions: List[String], path: String): Unit = {
if (spark.catalog.tableExists(tableName)) {
//clean partitions
val asColumns = partitions.map(c => new Column(c))
val relevantPartitions = df.select(asColumns: _*).distinct().collect()
val partitionToRemove = relevantPartitions.map(row => {
val fields = row.schema.fields
s"ALTER TABLE ${tableName} DROP IF EXISTS PARTITION " +
s"${fields.map(field => s"${field.name}='${row.getAs(field.name)}'").mkString("(", ",", ")")} PURGE"
})
val cleanFolders = relevantPartitions.map(partition => {
val fields = partition.schema.fields
path + fields.map(f => s"${f.name}=${partition.getAs(f.name)}").mkString("/")
})
println(s"Going to clean ${partitionToRemove.size} partitions")
partitionToRemove.foreach(partition => spark.sqlContext.sql(partition))
cleanFolders.foreach(partition => deletePath(partition, true))
}
asDataFrame.write
.options(Map("path" -> myTablePath))
.mode(SaveMode.Append)
.partitionBy(partitionColumns: _*)
.saveAsTable(tableName)
}
//Now test
tableOverwrite(asDataFrame, partitionColumns, tableName)
spark.sqlContext.sql(s"select * from $tableName").show(1000)
tableOverwrite(asDataFrame, partitionColumns, tableName)
import spark.implicits._
val asLocalSet = spark.sqlContext.sql(s"select * from $tableName").as[DataExample].collect().toSet
if (asLocalSet == listData.toSet) {
println("Overwrite is working !!!")
}
}
答案 7 :(得分:1)
spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
data.toDF().write.mode("overwrite").format("parquet").partitionBy("date", "name").save("s3://path/to/somewhere")
这对我适用于AWS Glue ETL作业(Glue 1.0-Spark 2.4-Python 2)
答案 8 :(得分:0)
你可以做这样的事情来使工作可以重入(幂等): (在火花2.2上试过这个)
# drop the partition
drop_query = "ALTER TABLE table_name DROP IF EXISTS PARTITION (partition_col='{val}')".format(val=target_partition)
print drop_query
spark.sql(drop_query)
# delete directory
dbutils.fs.rm(<partition_directoy>,recurse=True)
# Load the partition
df.write\
.partitionBy("partition_col")\
.saveAsTable(table_name, format = "parquet", mode = "append", path = <path to parquet>)
答案 9 :(得分:0)
我建议您进行清理,然后使用Append
模式编写新分区:
import scala.sys.process._
def deletePath(path: String): Unit = {
s"hdfs dfs -rm -r -skipTrash $path".!
}
df.select(partitionColumn).distinct.collect().foreach(p => {
val partition = p.getAs[String](partitionColumn)
deletePath(s"$path/$partitionColumn=$partition")
})
df.write.partitionBy(partitionColumn).mode(SaveMode.Append).orc(path)
这将仅删除新分区。写入数据后,如果需要更新Metastore,请运行此命令:
sparkSession.sql(s"MSCK REPAIR TABLE $db.$table")
注意: deletePath
假设您的系统上有hfds
命令。
答案 10 :(得分:0)
在Scala的Spark 2.3.1上进行了测试。
上面的大多数答案都写入Hive表。但是,我想直接写入磁盘,该磁盘顶部有一个external hive table
。
首先需要配置
val sparkSession: SparkSession = SparkSession
.builder
.enableHiveSupport()
.config("spark.sql.sources.partitionOverwriteMode", "dynamic") // Required for overwriting ONLY the required partitioned folders, and not the entire root folder
.appName("spark_write_to_dynamic_partition_folders")
在这里使用:
DataFrame
.write
.format("<required file format>")
.partitionBy("<partitioned column name>")
.mode(SaveMode.Overwrite) // This is required.
.save(s"<path_to_root_folder>")
答案 11 :(得分:0)
在insertInto语句中添加'overwrite = True'参数可解决此问题:
hiveContext.setConf("hive.exec.dynamic.partition", "true")
hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
df.write.mode("overwrite").insertInto("database_name.partioned_table", overwrite=True)
默认为overwrite=False
。将其更改为True
可使我们覆盖df
和partioned_table中包含的特定分区。这有助于我们避免使用df
覆盖parteded_table的全部内容。
答案 12 :(得分:0)
对于> = Spark 2.3.0:
spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
data.write.insertInto("partitioned_table", overwrite=True)