我有一个包含10K行的数据帧(df)。我的数据框看起来像-
id value
1 .65
2 .89
3 .33
4 .92
5 .95
6 .5
我的存储桶应为= [0,.60,.76,1]
我的预期成绩-
id value bucket value_.6_.76 value_.76_1 value_0_.6
1 .65 [.6-.76] 1 0 0
2 .89 [.76-1] 0 1 0
3 .33 [0-.6] 0 0 1
4 .92 [.76-1] 0 1 0
5 .95 [.76-1] 0 1 0
6 .5 [.60-.76] 1 0 0
我到目前为止已完成-
在我的数据框中,id =字符串,值= double
from pyspark.ml.feature import Bucketizer
bucketizer = Bucketizer(splits=[ 0,.60,.76,1 ],inputCol="value", outputCol="value_bucket", handleInvalid="keep")
df1 = bucketizer.transform(df1)
value_buckets = [0,.60,.76,1]
bins = list(zip(value_buckets, value_buckets[1:]))
data = [[i] + ['({0}-{1}]'.format(*bin_endpoints)] + [0] * i + [1] + [0] * (len(bins) - i - 1)
for i, bin_endpoints in enumerate(bins)]
schema = ', '.join('value_bucket_{}_{}: double'.format(start, end)
for start, end in zip(value_buckets, value_buckets[1:]))
join_df = spark.createDataFrame(data, 'value_buckets: double, value_buckets_string: string, ' + schema)
df2 = (df1.join(join_df, on='value_bucket', how='left')
.drop('value_bucket')
.withColumnRenamed('value_bucket_string', 'value_bucket')
.orderBy('id'))