PySpark中的一种热门编码

时间:2019-06-18 08:05:19

标签: pyspark pyspark-sql

我有一个包含10K行的数据帧(df)。我的数据框看起来像-

id     value
 1       .65
 2       .89
 3       .33
 4       .92
 5       .95
 6        .5

我的存储桶应为= [0,.60,.76,1]

我的预期成绩-

id     value      bucket       value_.6_.76     value_.76_1     value_0_.6
 1       .65      [.6-.76]           1               0               0
 2       .89      [.76-1]            0               1               0
 3       .33      [0-.6]             0               0               1
 4       .92      [.76-1]            0               1               0
 5       .95      [.76-1]            0               1               0
 6        .5      [.60-.76]          1               0               0

我到目前为止已完成-

在我的数据框中,id =字符串,值= double

from pyspark.ml.feature import Bucketizer
bucketizer = Bucketizer(splits=[ 0,.60,.76,1 ],inputCol="value", outputCol="value_bucket", handleInvalid="keep")
df1 = bucketizer.transform(df1)


value_buckets = [0,.60,.76,1]
bins = list(zip(value_buckets, value_buckets[1:]))

data = [[i] + ['({0}-{1}]'.format(*bin_endpoints)] + [0] * i + [1] + [0] * (len(bins) - i - 1) 
        for i, bin_endpoints in enumerate(bins)]
schema = ', '.join('value_bucket_{}_{}: double'.format(start, end) 
                   for start, end in zip(value_buckets, value_buckets[1:]))

join_df = spark.createDataFrame(data, 'value_buckets: double, value_buckets_string: string, ' + schema)



df2 = (df1.join(join_df, on='value_bucket', how='left')
             .drop('value_bucket')
             .withColumnRenamed('value_bucket_string', 'value_bucket')
             .orderBy('id'))

0 个答案:

没有答案