我在Postgres中有一个名为“mytable”的表,有两列,id(bigint)和value(varchar(255))。
id使用nextval('my_sequence')
从序列中获取其值。
PySpark应用程序接受数据帧并使用Postgres JDBC jar(postgresql-42.1.4.jar)将数据帧插入“mytable”。我正在使用:
创建id列df.withColumn('id', lit("nextval('my_sequence')"))
Postgres将该列解释为“变化的字符”。
我可以看到有很多方法可以在读取数据时调用Postgres方法(How to remotely execute a Postgres SQL function on Postgres using PySpark JDBC connector?),但是我不知道如何调用像nextval()
这样的Postgres函数来将数据写入Postgres。 / p>
以下是我目前如何将Pyspark的数据写入Postgres:
df.write.format("jdbc") \
.option("url", jdbc_url) \
.option("dbtable", 'mytable') \
.mode('append') \
.save()
当一列需要使用nextval()
的序列号时,如何使用PySpark写入Postgres表?
答案 0 :(得分:1)
TL; DR 除非您创建自己的<?xml version='1.0' encoding='UTF-8'?>
<peci:Workers_Effective_Stack xmlns:peci="urn:com.workday/peci">
<peci:Summary>
<peci:Integration_Sent_On>2018-01-20T23:15:13.284-08:00</peci:Integration_Sent_On>
<peci:Version>1</peci:Version>
</peci:Summary>
<peci:Worker>
<peci:Worker_Summary>
<peci:Employee_ID>412548</peci:Employee_ID>
<peci:Name>Mr AARON FORTUNE</peci:Name>
</peci:Worker_Summary>
<peci:Effective_Change peci:Sequence="0">
<peci:Derived_Event_Code>DTA</peci:Derived_Event_Code>
<peci:Time_Off_Earnings_and_Deductions peci:isAdded="1">
<peci:External_Payroll_Code_Name>Additional Time Off
Unpaid</peci:External_Payroll_Code_Name>
<peci:External_Payroll_Code>Addtl_TO_Unpaid</peci:External_Payroll_Code>
<peci:Time_Off_Type>Additional_Time_Off_Paid_type</peci:Time_Off_Type>
<peci:Unit_of_Time>DAYS</peci:Unit_of_Time>
<peci:Time_Off_Entry>
<peci:Units>1</peci:Units>
</peci:Time_Off_Entry>
</peci:Time_Off_Earnings_and_Deductions>
</peci:Effective_Change>
<peci:Effective_Change peci:Sequence="1">
<peci:Derived_Event_Code>DTA</peci:Derived_Event_Code>
<peci:Time_Off_Earnings_and_Deductions peci:isUpdated="1">
<peci:External_Payroll_Code_Name>Additional Time Off
Unpaid</peci:External_Payroll_Code_Name>
<peci:External_Payroll_Code>Addtl_TO_Unpaid</peci:External_Payroll_Code>
<peci:Time_Off_Type>Additional_Time_Off_Paid_type</peci:Time_Off_Type>
<peci:Unit_of_Time>DAYS</peci:Unit_of_Time>
<peci:Time_Off_Entry peci:isAdded="1">
<peci:Units>1</peci:Units>
</peci:Time_Off_Entry>
</peci:Time_Off_Earnings_and_Deductions>
</peci:Effective_Change>
<peci:Effective_Change peci:Sequence="2">
<peci:Derived_Event_Code>DTA</peci:Derived_Event_Code>
<peci:Time_Off_Earnings_and_Deductions peci:isUpdated="1">
<peci:External_Payroll_Code_Name>Additional Time Off
Unpaid</peci:External_Payroll_Code_Name>
<peci:External_Payroll_Code>Addtl_TO_Unpaid</peci:External_Payroll_Code>
<peci:Time_Off_Type>Additional_Time_Off_Paid_type</peci:Time_Off_Type>
<peci:Unit_of_Time>DAYS</peci:Unit_of_Time>
<peci:Time_Off_Entry peci:isAdded="1">
<peci:Units>1</peci:Units>
</peci:Time_Off_Entry>
</peci:Time_Off_Earnings_and_Deductions>
</peci:Effective_Change>
<peci:Effective_Change peci:Sequence="3">
<peci:Derived_Event_Code>DTA</peci:Derived_Event_Code>
<peci:Time_Off_Earnings_and_Deductions peci:isUpdated="1">
<peci:External_Payroll_Code_Name>Additional Time Off
Unpaid</peci:External_Payroll_Code_Name>
<peci:External_Payroll_Code>Addtl_TO_Unpaid</peci:External_Payroll_Code>
<peci:Time_Off_Type>Additional_Time_Off_Paid_type</peci:Time_Off_Type>
<peci:Unit_of_Time>DAYS</peci:Unit_of_Time>
<peci:Time_Off_Entry peci:isAdded="1">
<peci:Units>1</peci:Units>
</peci:Time_Off_Entry>
</peci:Time_Off_Earnings_and_Deductions>
</peci:Effective_Change>
</peci:Worker>
</peci:Workers_Effective_Stack>
并覆盖插入逻辑,否则无法在插入时执行数据库代码。我认为这不是你想为这么小的功能做的事情。
我个人会使用触发器:
JdbcDialect
并将剩余的工作留给数据库服务器。
CREATE FUNCTION set_id() RETURNS trigger AS $set_id$
BEGIN
IF NEW.id IS NULL THEN
NEW.id = nextval('my_sequence');
END IF;
RETURN NEW;
END;
$set_id$ LANGUAGE plpgsql;
CREATE TRIGGER set_id BEFORE INSERT ON mytable
FOR EACH ROW EXECUTE PROCEDURE set_id();
你也可以使用df.select(lit(null).cast("bigint").alias("id"), col("value")).write
...
(Primary keys with Apache Spark)并根据数据库中最大的id来移位值,但它可能很脆弱。