Question

import numpy as np

df = spark.createDataFrame(
    [(1, 1, None), (1, 2, float(5)), (1, 3, np.nan), (1, 4, None), (1, 5, float(10)), (1, 6, float('nan')), (1, 6, float('nan'))],
    ('session', "timestamp1", "id2"))

预期输出

每列的计数为nan / null的数据框

注意：的我在堆栈溢出中发现的先前问题仅检查null＆amp;不是南。这就是为什么我创造了一个新问题。

我知道我可以在spark中使用isnull（）函数来查找Spark列中的Null值的数量但是如何在Spark数据帧中找到Nan值？

Answer 1

您可以使用here显示的方法，并将isnan替换为from pyspark.sql.functions import isnan, when, count, col df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show() +-------+----------+---+ |session|timestamp1|id2| +-------+----------+---+ | 0| 0| 3| +-------+----------+---+：

df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()
+-------+----------+---+
|session|timestamp1|id2|
+-------+----------+---+
|      0|         0|  5|
+-------+----------+---+

或

    >      Cross-Origin Request Blocked: The Same Origin Policy disallows reading
    >     the remote resource at http://ip/JsonServices.php?action=getempdata.
    >     This can be fixed by moving the resource to the same domain or
    >     enabling CORS.

        **Below is my code**


  import { Injectable } from '@angular/core';
    import { Http ,Response} from '@angular/http';
    import 'rxjs/add/operator/map';

    @Injectable()
    export class WebserviceProvider {

      constructor(public http: Http) {

        console.log('Hello WebserviceProvider Provider');

      }

     getUser() {


      return this.http.get('http://ip/JsonServices.php?action=getempdata')

        .map((res:Response) => res.json());
      }

    }

Answer 2

您可以创建UDF以同时null和NaN，并返回boolean值以过滤

代码是scala代码，希望你能转换为python。

val isNaN = udf((value : Float) => {
   if (value.equals(Float.NaN) || value == null) true else false }) 

val result = data.filter(isNaN(data("column2"))).count()

希望这有帮助！

Answer 3

对于pyspark数据框中的空值

Dict_Null = {col:df.filter(df[col].isNull()).count() for col in df.columns}
Dict_Null

# The output in dict where key is column name and value is null values in that column

{'#': 0,
 'Name': 0,
 'Type 1': 0,
 'Type 2': 386,
 'Total': 0,
 'HP': 0,
 'Attack': 0,
 'Defense': 0,
 'Sp_Atk': 0,
 'Sp_Def': 0,
 'Speed': 0,
 'Generation': 0,
 'Legendary': 0}

Answer 4

这是我的一支班轮。这里的“ c”是列的名称

df.select('c').withColumn('isNull_c',F.col('c').isNull()).where('isNull_c = True').count()

Answer 5

为确保let ndc_corners = [ [-1,-1,-1], [1,-1,-1], [-1,1,-1], [1,1,-1], [-1,-1, 1], [1,-1, 1], [-1,1, 1], [1,1, 1]]; let world_corners = [] for (let i=0; i < ndc_corners.length; ++i) { let ndc_v = new THREE.Vector3(...ndc_corners[i]); world_corners.push(ndc_v.unproject(camera)); }和string列不会失败：

timestamp

如果要查看按nans和null降序排列的列排序：

import pyspark.sql.functions as F
# count both nans and nulls
df.select([F.count(F.when(F.isnan(c) | F.isnull(c), c)).alias(c) for (c,c_type) in df.dtypes if c_type not in ('timestamp','string')]).show(vertical=True)

# | Col_A | Col_B | Col_C |
# |  10   |   1   |   2   |

Answer 6

已经提供的方法的一种替代方法是像这样简单地对列进行过滤

df = df.where(F.col('columnNameHere').isNull())

这样做的好处是您不必添加其他列即可进行过滤，并且在处理较大的数据集时很快。

Answer 7

我更喜欢这个解决方案：

df = spark.table(selected_table).filter(condition)

counter = df.count()

df = df.select([(counter - count(c)).alias(c) for c in df.columns])

如何有效地找到PySpark数据帧中每列的Null和Nan值的计数？

7 个答案: