我正在使用pyspark 1.6.1
,我创建了一个像这样的数据框:
toy_df = sqlContext.createDataFrame([('blah',10)], ['name', 'age'])
现在,观看当我尝试使用'blah'
并再次使用where
在此数据框中查询select
时会发生什么:
toy_df_where = toy_df.where(toy_df['name'] != 'blah')
toy_df_where.count()
0
toy_df_select = toy_df.select(toy_df['name'] != 'blah')
toy_df_select.count()
1
为什么这两个选项的结果不同?
谢谢。
答案 0 :(得分:2)
where
以及filter
用于过滤行,而select
用于选择列,因此在您的select语句中,toy_df['name'] != 'blah'
构造一个新列布尔值和select方法将其选择到结果数据框中,或者更清楚地看到这个例子:
>>> toy_df = sqlContext.createDataFrame([('blah',10), ('foo', 20)], ['name', 'age'])
>>> toy_df_where = toy_df.where(toy_df['name'] != 'blah')
>>> toy_df_where.show()
+----+---+
|name|age|
+----+---+
| foo| 20|
+----+---+
# filter works the same way as where
>>> toy_df_filter = toy_df.filter(toy_df['name'] != 'blah')
>>> toy_df_filter.show()
+----+---+
|name|age|
+----+---+
| foo| 20|
+----+---+
>>> toy_df_select = toy_df.select((toy_df['name'] != 'blah').alias('cond'))
# give the column a new name with alias
>>> toy_df_select.show()
+-----+
| cond|
+-----+
|false|
| true|
+-----+