如何根据同一列的条件更改PySpark数据框中的值?

时间:2019-04-21 16:58:07

标签: pyspark apache-spark-sql pyspark-sql

考虑一个示例数据框:

df = 
+-------+-----+
|   tech|state|
+-------+-----+
|     70|wa   |
|     50|mn   |
|     20|fl   |
|     50|mo   |
|     10|ar   |
|     90|wi   |
|     30|al   |
|     50|ca   |
+-------+-----+

我想更改“技术”列,以便将任何值50都更改为1,而所有其他值都等于0。

输出看起来像这样:

df = 
+-------+-----+
|   tech|state|
+-------+-----+
|     0 |wa   |
|     1 |mn   |
|     0 |fl   |
|     1 |mo   |
|     0 |ar   |
|     0 |wi   |
|     0 |al   |
|     1 |ca   |
+-------+-----+

这是我到目前为止所拥有的:

from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import StringType


changing_column = 'tech'
udf_first = UserDefinedFunction(lambda x: 1, IntegerType())
udf_second = UserDefinedFunction(lambda x: 0, IntegerType())
first_df = zero_df.select(*[udf_first(changing_column) if column == 50 else column for column in zero_df])
second_df = first_df.select(*[udf_second(changing_column) if column != 50 else column for column in first_df])
second_df.show()

1 个答案:

答案 0 :(得分:1)

希望这会有所帮助

' Connect to active directory
Set objDSE = GetObject("LDAP://rootDSE")
Set objConnection = CreateObject("ADODB.Connection")
objConnection.Provider = "ADsDSOObject"
objConnection.Open
Set objCommand = CreateObject("ADODB.Command")
Set objCommand.ActiveConnection = objConnection
SearchString = "Max Mustermann"

' Contact lookup using SQL-query
objCommand.CommandText = _
    "SELECT givenname, sn, mail, telephoneNumber, mobile, mailNickName, c, l, postalCode, department, company, streetAddress " & _
    "FROM 'LDAP://" & objDSE.Get("defaultNamingContext") & "' " & _
    "WHERE objectCategory='person' AND (mail = '" & SearchString t & "' OR givenname & sn = '" & SearchString & "')"
Set objRecordset = objCommand.Execute

If Not objRecordset.EOF Then
' Further processing which is not relevant to the question
' ...