Question

我正在使用pyspark和hivecontext.sql，我想从我的数据中过滤掉所有空值和空值。

所以我使用简单的sql命令来首先过滤掉空值，但它不起作用。

我的代码：

hiveContext.sql("select column1 from table where column2 is not null")

但它没有表达式“where column2 not null”

错误：

Py4JavaError: An error occurred while calling o577.showString

我认为这是由于我的选择是错误的。

数据示例：

column 1 | column 2
null     |   1
null     |   2
1        |   3
2        |   4
null     |   2
3        |   8

目标：

column 1 | column 2
1        |   3
2        |   4
3        |   8

韩国社交协会

Answer 1

我们不能将Hive表名直接传递给Hive上下文sql方法，因为它不了解Hive表名。阅读Hive表的方法之一是使用pysaprk shell。

我们需要注册从阅读hive表中获得的数据框。然后我们可以运行SQL查询。

Answer 2

您必须提供database_name.table并运行相同的查询。如果有帮助，请告诉我

Answer 3

对我有用：

df.na.drop(subset=["column1"])

Answer 4

Have you entered null values manually?
If yes then it will treat those as normal strings,
I tried following two use cases

dbname.person table in hive

name    age

aaa     null // this null is entered manually -case 1
Andy    30
Justin  19
okay       NULL // this null came as this field was left blank. case 2

---------------------------------
hiveContext.sql("select * from dbname.person").show();
+------+----+
|   name| age|
+------+----+
|  aaa |null|
|  Andy|  30|
|Justin|  19|
|  okay|null|
+------+----+

-----------------------------
case 2 
hiveContext.sql("select * from dbname.person where age is not null").show();
+------+----+
|  name|age |
+------+----+
|  aaa |null|
|  Andy| 30 |
|Justin| 19 |
+------+----+
------------------------------------
case 1
hiveContext.sql("select * from dbname.person where age!= 'null'").show();
+------+----+
|  name| age|
+------+----+
|  Andy|  30|
|Justin|  19|
|  okay|null|
+------+----+
------------------------------------

我希望上面的用例可以清除你对过滤空值的疑虑出。如果您要查询在spark中注册的表，请使用sqlContext。

在hivecontext.sql中过滤掉空字符串和空字符串

4 个答案: