为什么在版本1.6.1的pyspark数据帧中`where`与`select`的行为不同?

时间:2017-07-18 19:32:08

标签: python apache-spark pyspark apache-spark-sql pyspark-sql

我正在使用pyspark 1.6.1,我创建了一个像这样的数据框:

toy_df = sqlContext.createDataFrame([('blah',10)], ['name', 'age'])

现在,观看当我尝试使用'blah'并再次使用where在此数据框中查询select时会发生什么:

toy_df_where = toy_df.where(toy_df['name'] != 'blah')
toy_df_where.count()
0
toy_df_select = toy_df.select(toy_df['name'] != 'blah')
toy_df_select.count()
1

为什么这两个选项的结果不同?

谢谢。

1 个答案:

答案 0 :(得分:2)

where以及filter用于过滤行,而select用于选择列,因此在您的select语句中,toy_df['name'] != 'blah'构造一个新列布尔值和select方法将其选择到结果数据框中,或者更清楚地看到这个例子:

>>> toy_df = sqlContext.createDataFrame([('blah',10), ('foo', 20)], ['name', 'age'])

>>> toy_df_where = toy_df.where(toy_df['name'] != 'blah')
>>> toy_df_where.show()
+----+---+
|name|age|
+----+---+
| foo| 20|
+----+---+

# filter works the same way as where
>>> toy_df_filter = toy_df.filter(toy_df['name'] != 'blah')
>>> toy_df_filter.show()
+----+---+
|name|age|
+----+---+
| foo| 20|
+----+---+

>>> toy_df_select = toy_df.select((toy_df['name'] != 'blah').alias('cond'))
# give the column a new name with alias
>>> toy_df_select.show()
+-----+
| cond|
+-----+
|false|
| true|
+-----+