使用pyspark随机播放选定的列

时间:2019-07-09 03:34:08

标签: pyspark

我正试图做:

  1. 从DataFrame中随机选择几列
  2. 从第1步中选择的列中随机排列值
  3. 将第2步中的这些列添加回DataFrame

代码如下:

# Step 0: create data frame using list and tuple
df = sqlContext.createDataFrame([
        ("user1", 0, 1, 0, 1, 0, 1, 1, 0, 1, 0),
        ("user2", 1, 1, 0, 1, 0, 1, 1, 1, 1, 0),
        ("user3", 1, 1, 1, 1, 0, 0, 0, 1, 1, 0),
        ("user4", 0, 1, 0, 1, 1, 1, 1, 1, 0, 0),
        ("user5", 1, 1, 1, 1, 0, 1, 0, 1, 1, 0),
        ("user6", 0, 1, 0, 1, 1, 1, 1, 0, 1, 0)
    ], ["ID", "x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8", "x9"])

df.show()

DataFrame是:

enter image description here

import random
from pyspark.sql import functions as F

# define features
feature = [x for x in df.columns if x not in ['ID']]

# Step 1: random select a few columns from the DataFrame
random.seed(123)
random_col = random.sample(feature, 2)
print(random_col)

第1步运作良好。随机选择的特征是“ x0”,“ x4”

# shuffle the random selected columns to create random noise feature
for i in range(0, 2):
    # Step 2: shuffle the value from the column select from step 1
    rnd_df = df.select(random_col[i]).orderBy(F.rand(i)).withColumnRenamed(random_col[i], 'rnd_col').rnd_col
    # step 3: add these columns from step 2 back to the DataFrame
    df = df.withColumn('random'+ str(i+1), rnd_df)

第2步效果很好。但是,步骤3失败,并出现以下错误。有谁知道如何解决这个问题? enter image description here

0 个答案:

没有答案