我有一个数据框,其中包含以下几列:
category| category_id| bucket| prop_count| event_count | accum_prop_count | accum_event_count ----------------------------------------------------------------------------------------------------- nation | nation | 1 | 222 | 444 | 555 | 6677
此数据框从0行开始,我的脚本的每个函数都为此添加一行。
有一个函数需要根据条件修改1或2个单元格值。怎么做?
代码:
schema = StructType([StructField("category", StringType()), StructField("category_id", StringType()), StructField("bucket", StringType()), StructField("prop_count", StringType()), StructField("event_count", StringType()), StructField("accum_prop_count",StringType())])
a_df = sqlContext.createDataFrame([],schema)
a_temp = sqlContext.createDataFrame([("nation","nation",1,222,444,555)],schema)
a_df = a_df.unionAll(a_temp)
从其他一些功能添加的行:
a_temp3 = sqlContext.createDataFrame([("nation","state",2,222,444,555)],schema)
a_df = a_df.unionAll(a_temp3)
现在要修改,我正在尝试使用条件进行连接。
a_temp4 = sqlContext.createDataFrame([("state","state",2,444,555,666)],schema)
a_df = a_df.join(a_temp4, [(a_df.category_id == a_temp4.category_id) & (some other cond here)], how = "inner")
但是这段代码不起作用。我收到一个错误:
+--------+-----------+------+----------+-----------+----------------+--------+-----------+------+----------+-----------+----------------+ |category|category_id|bucket|prop_count|event_count|accum_prop_count|category|category_id|bucket|prop_count|event_count|accum_prop_count| +--------+-----------+------+----------+-----------+----------------+--------+-----------+------+----------+-----------+----------------+ | nation| state| 2| 222| 444| 555| state| state| 2| 444| 555| 666| +--------+-----------+------+----------+-----------+----------------+--------+-----------+------+----------+-----------+----------------+
如何解决这个问题?正确的输出应该有2行,第二行应该有更新的值
答案 0 :(得分:2)
1)。内部联接将从初始数据框中删除行,如果您想要与a_df
(左侧)具有相同数量的行,则需要左联接。
2)。如果您的列具有相同的名称,则==
条件将重复列,而您可以使用列表。
3)。我想“其他一些条件”是指bucket
4)。您希望保留a_temp4中的值(如果它存在则连接(如果不存在,则连接将其值设置为null),psf.coalesce
允许您执行此操作
import pyspark.sql.functions as psf
a_df = a_df.join(a_temp4, ["category_id", "bucket"], how="leftouter").select(
psf.coalesce(a_temp4.category, a_df.category).alias("category"),
"category_id",
"bucket",
psf.coalesce(a_temp4.prop_count, a_df.prop_count).alias("prop_count"),
psf.coalesce(a_temp4.event_count, a_df.event_count).alias("event_count"),
psf.coalesce(a_temp4.accum_prop_count, a_df.accum_prop_count).alias("accum_prop_count")
)
+--------+-----------+------+----------+-----------+----------------+
|category|category_id|bucket|prop_count|event_count|accum_prop_count|
+--------+-----------+------+----------+-----------+----------------+
| state| state| 2| 444| 555| 666|
| nation| nation| 1| 222| 444| 555|
+--------+-----------+------+----------+-----------+----------------+
如果您只使用单行数据帧,则应考虑直接编写更新而不是使用连接:
def update_col(category_id, bucket, col_name, col_val):
return psf.when((a_df.category_id == category_id) & (a_df.bucket == bucket), col_val).otherwise(a_df[col_name]).alias(col_name)
a_df.select(
update_col("state", 2, "category", "nation"),
"category_id",
"bucket",
update_col("state", 2, "prop_count", 444),
update_col("state", 2, "event_count", 555),
update_col("state", 2, "accum_prop_count", 666)
).show()