过滤pyspark数据帧以保持包含至少1个空值的行(保持,不丢弃)

时间:2016-12-25 12:04:06

标签: apache-spark pyspark pyspark-sql

假设我有以下pyspark数据帧:

>>> df = spark.createDataFrame([('A', 'Amsterdam', 3.4), ('B', 'London', None), ('C', None, None), ('D', None, 11.1)], ['c1', 'c2', 'c3'])
>>> df.show()
+---+---------+----+
| c1|       c2|  c3|
+---+---------+----+
|  A|Amsterdam| 3.4|
|  B|   London|null|
|  C|     null|null|
|  D|     null|11.1|
+---+---------+----+

我现在如何选择或过滤任何行,包含至少一个空值,如下所示?:

>>> df.SOME-COMMAND-HERE.show()
+---+---------+----+
| c1|       c2|  c3|
+---+---------+----+
|  B|   London|null|
|  C|     null|null|
|  D|     null|11.1|
+---+---------+----+

2 个答案:

答案 0 :(得分:2)

通过删除所需的行,从原始数据框创建一个中间数据框。然后"减去"它来自原文:

# Create the data frame
df = spark.createDataFrame([('A', 'Amsterdam', 3.4), ('B', 'London', None), ('C', None, None), ('D', None, 11.1)], ['c1', 'c2', 'c3'])
df.show()
+---+---------+----+
| c1|       c2|  c3|
+---+---------+----+
|  A|Amsterdam| 3.4|
|  B|   London|null|
|  C|     null|null|
|  D|     null|11.1|
+---+---------+----+

# Construct an intermediate dataframe without the desired rows
df_drop = df.dropna('any')
df_drop.show()
+---+---------+---+
| c1|       c2| c3|
+---+---------+---+
|  A|Amsterdam|3.4|
+---+---------+---+

# Then subtract it from the original to reveal the desired rows
df.subtract(df_drop).show()
+---+------+----+
| c1|    c2|  c3|
+---+------+----+
|  B|London|null|
|  C|  null|null|
|  D|  null|11.1|
+---+------+----+

答案 1 :(得分:0)

构造适当的原始SQL查询并应用:

*(c+k) = (int*)malloc(sizeof(int)*M);