Question

我想对pysaprk数据框的列应用某些功能，使用UDF来做到这一点，但是我希望返回的数据对象不同于数据框的列，pandas数据框，python列表，等

我正在使用分类器将每一列划分为类，但是我希望结果是类的摘要，而不是pyspark数据框修改，我不知道这对于UDF是否可行

我的代码是这样的

import numpy as np
import pandas as pd
import pyspark 
from pyspark.sql import SQLContext
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType, StringType, FloatType, DoubleType
sc = pyspark.SparkContext()
sqlCtx = SQLContext(sc)

df_pd = pd.DataFrame(
    data={ 'Income':[12.0,45.0,24.0,24.0,54.0],
           'Debt':[23.0,4.0,1.0,6.0,3.0]} )
df = sqlCtx.createDataFrame(df_pd)


# function
def clase(x):
    #n = np.mean(df_pd[name])
    #n = np.mean(df_pd["Ingresos"])
    n = 30
    m = 20
    if x>=n:
        x="good"
    elif x>=m:
        x="regular"
    else:
        x="bad"
    return x

# UDF
clase_udf = udf(lambda z: clase(z), StringType())

(
    df.select('Income',
              'Debt',
              clase_udf('Income').alias('new') )
    .show()
)

这给出了下一张表：

+------+----+-------+
|Income|Debt|    new|
+------+----+-------+
|  12.0|23.0|    bad|
|  45.0| 4.0|   good|
|  24.0| 1.0|regular|
|  24.0| 6.0|regular|
|  54.0| 3.0|   good|
+------+----+-------+

我想要得到的是这样的东西：

+-------+------------+
| Clases| Description|
+-------+------------+
|   good|   30<Income|
|regular|20<Income<30|
|    bad|   Income<20|
+-------+------------+

像案例摘要一样

Answer 1

您还需要使用udf并返回 StringType ：

如果您希望它是全局的，并且为了多种功能同时对其进行修改，我会淘汰您的常量。

n = 30
m = 20

def description(x):
    if x >= n:
        x = str(n) + " < Income"
    elif x >= m:
        x = str(m) + " < Income < " + str(n)
    else:
        x = "Income < " + str(m)
    return x

description_udf = udf(lambda z: description(z), StringType())

df.select(
    clase_udf('Income').alias('Clases'),
    description_udf("Income").alias("Description")
).distinct().show()

输出为：

pyspark中的UDF可以返回不同于列的对象吗？

1 个答案: