Question

我正试图做：

从DataFrame中随机选择几列
从第1步中选择的列中随机排列值
将第2步中的这些列添加回DataFrame

代码如下：

# Step 0: create data frame using list and tuple
df = sqlContext.createDataFrame([
        ("user1", 0, 1, 0, 1, 0, 1, 1, 0, 1, 0),
        ("user2", 1, 1, 0, 1, 0, 1, 1, 1, 1, 0),
        ("user3", 1, 1, 1, 1, 0, 0, 0, 1, 1, 0),
        ("user4", 0, 1, 0, 1, 1, 1, 1, 1, 0, 0),
        ("user5", 1, 1, 1, 1, 0, 1, 0, 1, 1, 0),
        ("user6", 0, 1, 0, 1, 1, 1, 1, 0, 1, 0)
    ], ["ID", "x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8", "x9"])

df.show()

DataFrame是：

import random
from pyspark.sql import functions as F

# define features
feature = [x for x in df.columns if x not in ['ID']]

# Step 1: random select a few columns from the DataFrame
random.seed(123)
random_col = random.sample(feature, 2)
print(random_col)

第1步运作良好。随机选择的特征是“ x0”，“ x4”

# shuffle the random selected columns to create random noise feature
for i in range(0, 2):
    # Step 2: shuffle the value from the column select from step 1
    rnd_df = df.select(random_col[i]).orderBy(F.rand(i)).withColumnRenamed(random_col[i], 'rnd_col').rnd_col
    # step 3: add these columns from step 2 back to the DataFrame
    df = df.withColumn('random'+ str(i+1), rnd_df)

第2步效果很好。但是，步骤3失败，并出现以下错误。有谁知道如何解决这个问题？

使用pyspark随机播放选定的列

0 个答案: