我正在尝试对5k数据集中包含“ road”一词的列进行存储。并创建一个新的数据框。
我不确定该怎么做,这是我到目前为止尝试过的:
from pyspark.ml.feature import Bucketizer
spike_cols = [col for col in df.columns if "road" in col]
for x in spike_cols :
bucketizer = Bucketizer(splits=[-float("inf"), 10, 100, float("inf")],
inputCol=x, outputCol=x + "bucket")
bucketedData = bucketizer.transform(df)
答案 0 :(得分:4)
在循环中修改df
:
from pyspark.ml.feature import Bucketizer
for x in spike_cols :
bucketizer = Bucketizer(splits=[-float("inf"), 10, 100, float("inf")],
inputCol=x, outputCol=x + "bucket")
df = bucketizer.transform(df)
或使用Pipeline
:
from pyspark.ml import Pipeline
from pyspark.ml.feature import Bucketizer
model = Pipeline(stages=[
Bucketizer(
splits=[-float("inf"), 10, 100, float("inf")],
inputCol=x, outputCol=x + "bucket") for x in spike_cols
]).fit(df)
model.transform(df)
答案 1 :(得分:0)
从 3.0.0 开始,Bucketizer 可以通过设置 inputCols 参数一次映射多个列。
所以这变得更容易了:
from pyspark.ml.feature import Bucketizer
splits = [-float("inf"), 10, 100, float("inf")]
params = [(col, col+'bucket', splits) for col in df.columns if "road" in col]
input_cols, output_cols, splits_array = zip(*params)
bucketizer = Bucketizer(inputCols=input_cols, outputCols=output_cols,
splitsArray=splits_array)
bucketedData = bucketizer.transform(df)