Pyspark 中的重塑功能(转置列)

时间:2021-04-06 07:58:51

标签: python r apache-spark pyspark apache-spark-sql

我正在将 R 脚本转换为 Pyspark,但在某一点上卡住了。下面是来自 R 脚本的代码:

## Transposing the stacked trt section to wider set for Outcome ID set
## c("Trial.ID", "Arm.ID", "Re.randomized.arm.id", "Phase.ID", "Period.ID")

TRT <- select(trt_stacked, -Planned.Treatment.ID) %>% 
  renameCol(c("Treatment.Administration.days.x", "Treatment.Administration.days.y"),
            c("Treatment.Administration.days.plan", "Treatment.Administration.days.act") )  %>% 
  reshape(direction = "wide", 
          idvar = c("Trial.ID", "Arm.ID", "Re.randomized.arm.id", "Phase.ID", "Period.ID",
                    "Phase", "Phase.Duration", "Phase.Duration.unit", "Phase.Description",
                    "Period", "Period.Duration", "Period.Dur.Unit", "Period.Description"), 
          timevar="Treatment.ID") 

我需要将这段代码转换成 Pyspark,虽然 spark 中有一个枢轴函数可以进行转置,但我不知道这个“重塑”函数的功能。

我知道这个 reshape 函数将除 idvar 中的所有列转换为所有不同的treatment_id 值的行的输出。它还将treatment_id 连接到所有转置列,如下所示:

Titration.1
Titration.Duration.1
Titration.Duration.Unit.1
Titration.Target.1
Titration.Value.1
Titration.unit.1
Treatment.name.1
Treatment.Class.1
Treatment.Description.1
Treatment.Start.Time.1
Treatment.End.Time.1

Titration.2
Titration.Duration.2
Titration.Duration.Unit.2
Titration.Target.2
Titration.Value.2
Titration.unit.2
Treatment.name.2
Treatment.Class.2
Treatment.Description.2
Treatment.Start.Time.2
Treatment.End.Time.2

R 中的 reshape 函数是否也会删除空值? Spark或Python中是否有类似的功能?

输入:

    |treatment_id|arm_id|re_randomized_arm_id|trial_id|phase_id| phase|phase_duration|phase_duration_unit|phase_description|titration|titration_duration|titration_duration_unit|titration_target|titration_value|titration_unit|
1|1|-999|16|1|Active|NA|NA|NA|titration|NA|NA|NA|NA|NA
2|1|-999|16|1|Active|NA|NA|NA|titration|NA|NA|NA|NA|NA
2|1|-999|16|1|Active|NA|NA|NA|No titration|NA|NA|NA|NA|NA

转置后的预期输出:

|treatment_id|arm_id|re_randomized_arm_id|trial_id|phase_id| phase|phase_duration|phase_duration_unit|phase_description|titration_1|titration_duration_1|titration_duration_unit_1|titration_target_1|titration_value_1|titration_unit_1|titration_2|titration_duration_2|titration_duration_unit_2|titration_target_2|titration_value_2|titration_unit_2|
1|1|-999|16|1|Active|NA|NA|NA|titration|NA|NA|NA|NA|NA
2|1|-999|16|1|Active|NA|NA|NA|titration|NA|NA|NA|NA|NA|No titration|NA|NA|NA|NA|NA

我可以尝试哪些新方法来解决这个问题?

0 个答案:

没有答案
相关问题