pyspark选择和映射在输出值方面有何不同?

时间:2016-11-11 18:16:32

标签: python apache-spark pyspark pyspark-sql

为什么这两种转换此数据帧的方法会产生不同的输出数据帧?在数据帧上使用select并在rdd上映射似乎输出相同的值,但是当我取一列的平均值时,我会得到不同的结果。这里发生了什么?

wrong_parsed_data_df = parsed_points_df.select((parsed_points_df.label - min_year).alias('label'), 'features')
parsed_data_df = parsed_points_df.rdd.map(lambda row: LabeledPoint(row['label'] - min_year, row['features'])).toDF()

# View the first point
print '\n{0}'.format(wrong_parsed_data_df.first())
print '\n{0}'.format(parsed_data_df.first())

print '\n{0}'.format(wrong_parsed_data_df.count())
print '\n{0}'.format(parsed_data_df.count())

print wrong_parsed_data_df.printSchema()
print parsed_points_df.printSchema()

OUPUTS:

Row(label=79.0, features=DenseVector([0.8841, 0.6105, 0.6005, 0.4747, 0.2472, 0.3573, 0.3441, 0.3396, 0.6009, 0.4257, 0.6049, 0.4192]))

Row(features=DenseVector([0.8841, 0.6105, 0.6005, 0.4747, 0.2472, 0.3573, 0.3441, 0.3396, 0.6009, 0.4257, 0.6049, 0.4192]), label=79.0)

6724

6724

root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)

None
root
 |-- features: vector (nullable = true)
 |-- label: double (nullable = true)

然后:

average_train_year = (parsed_train_data_df
                    .selectExpr('avg(label)').first())[0]

wrong_average_train_year = (wrong_parsed_train_data_df
                        .selectExpr('avg(label)').first())[0]

print average_train_year
print wrong_average_train_year

输出:

54.0403195838
54.0570419918

1 个答案:

答案 0 :(得分:0)

最可能的解决方案是代码中出现错误 - 您在第一个单元格中使用parsed_data_df,在第二个单元格中使用parsed_train_data_df