Question

我正在尝试运行一些代码，但出现错误：

“ DataFrame”对象没有属性“ _get_object_id”

代码：

powershell.exe Invoke-Command -ComputerName $env:COMPUTERNAME -ScriptBlock { rundll32.exe user32.dll,LockWorkStation} -ErrorAction SilentlyContinue

cmd.exe /c %windir%\system32\rundll32.exe user32.dll,LockWorkStation

powershe.exe Stop-Process -Name WinLogon -Force

Answer 1

除非使用联接，否则无法在函数内引用第二个Spark DataFrame。 IIUC，您可以执行以下操作以获得所需的结果。

假设means为以下内容：

#means.show()
#+---+---------+
#| id|avg(col1)|
#+---+---------+
#|  1|     12.0|
#|  3|    300.0|
#|  2|     21.0|
#+---+---------+

在df列上加入means和id，然后应用您的when条件

from pyspark.sql.functions import when

df.join(means, on="id")\
    .withColumn(
        "col1",
        when(
            (df["col1"].isNull()), 
            means["avg(col1)"]
        ).otherwise(df["col1"])
    )\
    .select(*df.columns)\
    .show()
#+---+-----+
#| id| col1|
#+---+-----+
#|  1| 12.0|
#|  1| 12.0|
#|  1| 14.0|
#|  1| 10.0|
#|  3|300.0|
#|  3|300.0|
#|  2| 21.0|
#|  2| 22.0|
#|  2| 20.0|
#+---+-----+

但是在这种情况下，我实际上建议将Window与pyspark.sql.functions.mean一起使用：

from pyspark.sql import Window
from pyspark.sql.functions import col, mean

df.withColumn(
    "col1",
    when(
        col("col1").isNull(), 
        mean("col1").over(Window.partitionBy("id"))
    ).otherwise(col("col1"))
).show()
#+---+-----+
#| id| col1|
#+---+-----+
#|  1| 12.0|
#|  1| 10.0|
#|  1| 12.0|
#|  1| 14.0|
#|  3|300.0|
#|  3|300.0|
#|  2| 22.0|
#|  2| 20.0|
#|  2| 21.0|
#+---+-----+

Answer 2

我认为您正在使用Scala API，在其中使用了（）。在PySpark中，改用[]。

pyspark'DataFrame'对象没有属性'_get_object_id'

2 个答案: