Question

在使用pyspark时，HAE能够混合静态和动态分区吗？我想要做的是执行动态分区，然后将最细粒度的分区级别设置为静态。静态分区将是ETL作业的标识符，这意味着它需要是静态分区，除非我事先将它作为列添加到数据框中的每个记录。

我现在所拥有的是以下内容。它可能不是最佳的，因此欢迎提出建议。

dataFrame.repartition('year','month','day','hour').write.partitionBy('year','month','day','hour').mode('append').parquet(args['s3_dest'])

不知何故，我想引入一个额外的静态分区，以便记录在s3中显示为： S3：//桶/年= XXXX /月= XX /天= XX /小时= XX /流程id = XX

在etl脚本中生成executionId。

Answer 1

只需将partitionBy添加为新列，然后将其附加到from pyspark.sql.functions import lit executionId = ... (dataFrame.withColumn('executionId', lit(executionId)) .repartition('year', 'month', 'day', 'hour') # No executionId here! .write.partitionBy('year', 'month', 'day', 'hour', 'executionId') .mode('append').parquet(args['s3_dest']))列表：

class Recipe(db.Model):

    recipeID = Column(Integer, primary_key=True)
    userID = Column(ForeignKey('user.userID'), nullable=False)
    name = Column(String(35), nullable=False)
    description = Column(String(140), nullable=False)

    User = db.relationship('User')

    @hybrid_property
    def calc_totalPrice(self):
        calculatedPrice = func.sum(Ingredient.price).label('price')
        recipeIngredientJoin = Recipe.query.join(IngredientsToRecipe,Recipe.recipeID == IngredientsToRecipe.recipeID).join(Ingredient,IngredientsToRecipe.siin == Ingredient.siin).add_columns(calculatedPrice).group_by(Recipe.recipeID).filter(Recipe.recipeID == self.recipeID).first()
        print(calculatedPrice)
        return calculatedPrice

火花混合动态/静态分区

1 个答案: