我正试图做:
代码如下:
# Step 0: create data frame using list and tuple
df = sqlContext.createDataFrame([
("user1", 0, 1, 0, 1, 0, 1, 1, 0, 1, 0),
("user2", 1, 1, 0, 1, 0, 1, 1, 1, 1, 0),
("user3", 1, 1, 1, 1, 0, 0, 0, 1, 1, 0),
("user4", 0, 1, 0, 1, 1, 1, 1, 1, 0, 0),
("user5", 1, 1, 1, 1, 0, 1, 0, 1, 1, 0),
("user6", 0, 1, 0, 1, 1, 1, 1, 0, 1, 0)
], ["ID", "x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8", "x9"])
df.show()
DataFrame是:
import random
from pyspark.sql import functions as F
# define features
feature = [x for x in df.columns if x not in ['ID']]
# Step 1: random select a few columns from the DataFrame
random.seed(123)
random_col = random.sample(feature, 2)
print(random_col)
第1步运作良好。随机选择的特征是“ x0”,“ x4”
# shuffle the random selected columns to create random noise feature
for i in range(0, 2):
# Step 2: shuffle the value from the column select from step 1
rnd_df = df.select(random_col[i]).orderBy(F.rand(i)).withColumnRenamed(random_col[i], 'rnd_col').rnd_col
# step 3: add these columns from step 2 back to the DataFrame
df = df.withColumn('random'+ str(i+1), rnd_df)