获取组中的第一个非空值(Spark 1.6)

时间:2016-05-20 03:14:08

标签: apache-spark pyspark spark-dataframe apache-spark-1.6

如何从组中获取第一个非空值?我尝试将firstcoalesce JNIEXPORT jobject JNICALL Java_package_Class_function (JNIEnv * env, jclass cls, ...){ // Find Java class jclass c = (*env)->FindClass(env, "package/Tagged"); if (c == 0) { printf("Find Class Failed.\n"); }else{ printf("Found class.\n"); } // Find Tagged<T> constructor jmethodID constructor = (*env)->GetMethodID(env,c, "<init>", "(Ljava/lang/Object;J)V"); if (constructor == 0) { printf("Find method Failed.\n"); } else { printf("Found method.\n"); } // Get value int * valptr = LibraryCall(); // check that constructor arguments are what we expect int val = (int) *valptr; printf("Value: %i\n",val); long long addr = (long long) valptr; printf("Address: %p = %lld = %p\n",valptr,addr,(void *)addr); // Try to create Tagged object jobject taggedval = (*env)->NewObject(env, c, constructor, val, addr); printf("We never get here\n"); return taggedval; } 一起使用但我没有获得所需的行为(我似乎得到了第一行)。

Found class.
Found method.
Value:  102583
Address: 0x7fdcc2d209b0 = 140586138077616 = 0x7fdcc2d209b0
#
# A fatal error has been detected by the Java Runtime Environment: 
#
#  SIGSEGV (0xb) at pc=0x0000000109ae9bcf, pid=42140, tid=3847
#
# JRE version: Java(TM) SE Runtime Environment (8.0_66-b17) (build 1.8.0_66-b17)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.66-b17 mixed mode bsd-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.dylib+0x2e9bcf]  JavaCallArguments::parameters()+0x27
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /Users/eleanor/Documents/workspace/av-java/src/hs_err_pid42140.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#
Abort trap: 6

我试过了:

F.first(F.coalesce("code"))

期望的输出

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import functions as F

sc = SparkContext("local")

sqlContext = SQLContext(sc)

df = sqlContext.createDataFrame([
    ("a", None, None),
    ("a", "code1", None),
    ("a", "code2", "name2"),
], ["id", "code", "name"])

2 个答案:

答案 0 :(得分:13)

对于Spark 1.3 - 1.5,这可以解决问题:

from pyspark.sql import functions as F
df.groupBy(df['id']).agg(F.first(df['code']), F.first(df['name'])).show()

+---+-----------+-----------+
| id|FIRST(code)|FIRST(name)|
+---+-----------+-----------+
|  a|      code1|      name2|
+---+-----------+-----------+

修改

显然,在1.6版本中,他们改变了first聚合函数的处理方式。现在,底层类First应该使用第二个参数ignoreNullsExpr参数构造,first聚合函数尚未使用该参数(可以看到here)。但是,在Spark 2.0中,它将能够调用agg(F.first(col, True))来忽略空值(可以检查here)。

因此,对于Spark 1.6来说,这种方法必须是不同的,而且效率要低一些,这是非常有效的。一个想法如下:

from pyspark.sql import functions as F
df1 = df.select('id', 'code').filter(df['code'].isNotNull()).groupBy(df['id']).agg(F.first(df['code']))
df2 = df.select('id', 'name').filter(df['name'].isNotNull()).groupBy(df['id']).agg(F.first(df['name']))
result = df1.join(df2, 'id')
result.show()

+---+-------------+-------------+
| id|first(code)()|first(name)()|
+---+-------------+-------------+
|  a|        code1|        name2|
+---+-------------+-------------+

也许有更好的选择。如果我找到答案,我会编辑答案。

答案 1 :(得分:1)

因为每个分组只有一个非空值,所以使用1.6中的min / max为我的目的工作:

(df
  .groupby("id")
  .agg(F.min("code"),
       F.min("name"))
  .show())

+---+---------+---------+
| id|min(code)|min(name)|
+---+---------+---------+
|  a|    code1|    name2|
+---+---------+---------+