为什么这两种转换此数据帧的方法会产生不同的输出数据帧?在数据帧上使用select并在rdd上映射似乎输出相同的值,但是当我取一列的平均值时,我会得到不同的结果。这里发生了什么?
wrong_parsed_data_df = parsed_points_df.select((parsed_points_df.label - min_year).alias('label'), 'features')
parsed_data_df = parsed_points_df.rdd.map(lambda row: LabeledPoint(row['label'] - min_year, row['features'])).toDF()
# View the first point
print '\n{0}'.format(wrong_parsed_data_df.first())
print '\n{0}'.format(parsed_data_df.first())
print '\n{0}'.format(wrong_parsed_data_df.count())
print '\n{0}'.format(parsed_data_df.count())
print wrong_parsed_data_df.printSchema()
print parsed_points_df.printSchema()
OUPUTS:
Row(label=79.0, features=DenseVector([0.8841, 0.6105, 0.6005, 0.4747, 0.2472, 0.3573, 0.3441, 0.3396, 0.6009, 0.4257, 0.6049, 0.4192]))
Row(features=DenseVector([0.8841, 0.6105, 0.6005, 0.4747, 0.2472, 0.3573, 0.3441, 0.3396, 0.6009, 0.4257, 0.6049, 0.4192]), label=79.0)
6724
6724
root
|-- label: double (nullable = true)
|-- features: vector (nullable = true)
None
root
|-- features: vector (nullable = true)
|-- label: double (nullable = true)
然后:
average_train_year = (parsed_train_data_df
.selectExpr('avg(label)').first())[0]
wrong_average_train_year = (wrong_parsed_train_data_df
.selectExpr('avg(label)').first())[0]
print average_train_year
print wrong_average_train_year
输出:
54.0403195838
54.0570419918
答案 0 :(得分:0)
最可能的解决方案是代码中出现错误 - 您在第一个单元格中使用parsed_data_df
,在第二个单元格中使用parsed_train_data_df
。