Question

如何从组中获取第一个非空值？我尝试将first与coalesce JNIEXPORT jobject JNICALL Java_package_Class_function (JNIEnv * env, jclass cls, ...){ // Find Java class jclass c = (*env)->FindClass(env, "package/Tagged"); if (c == 0) { printf("Find Class Failed.\n"); }else{ printf("Found class.\n"); } // Find Tagged<T> constructor jmethodID constructor = (*env)->GetMethodID(env,c, "<init>", "(Ljava/lang/Object;J)V"); if (constructor == 0) { printf("Find method Failed.\n"); } else { printf("Found method.\n"); } // Get value int * valptr = LibraryCall(); // check that constructor arguments are what we expect int val = (int) *valptr; printf("Value: %i\n",val); long long addr = (long long) valptr; printf("Address: %p = %lld = %p\n",valptr,addr,(void *)addr); // Try to create Tagged object jobject taggedval = (*env)->NewObject(env, c, constructor, val, addr); printf("We never get here\n"); return taggedval; }一起使用但我没有获得所需的行为（我似乎得到了第一行）。

Found class.
Found method.
Value:  102583
Address: 0x7fdcc2d209b0 = 140586138077616 = 0x7fdcc2d209b0
#
# A fatal error has been detected by the Java Runtime Environment: 
#
#  SIGSEGV (0xb) at pc=0x0000000109ae9bcf, pid=42140, tid=3847
#
# JRE version: Java(TM) SE Runtime Environment (8.0_66-b17) (build 1.8.0_66-b17)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.66-b17 mixed mode bsd-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.dylib+0x2e9bcf]  JavaCallArguments::parameters()+0x27
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /Users/eleanor/Documents/workspace/av-java/src/hs_err_pid42140.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#
Abort trap: 6

我试过了：

F.first(F.coalesce("code"))

期望的输出

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import functions as F

sc = SparkContext("local")

sqlContext = SQLContext(sc)

df = sqlContext.createDataFrame([
    ("a", None, None),
    ("a", "code1", None),
    ("a", "code2", "name2"),
], ["id", "code", "name"])

Answer 1

对于Spark 1.3 - 1.5，这可以解决问题：

from pyspark.sql import functions as F
df.groupBy(df['id']).agg(F.first(df['code']), F.first(df['name'])).show()

+---+-----------+-----------+
| id|FIRST(code)|FIRST(name)|
+---+-----------+-----------+
|  a|      code1|      name2|
+---+-----------+-----------+

修改

显然，在1.6版本中，他们改变了first聚合函数的处理方式。现在，底层类First应该使用第二个参数ignoreNullsExpr参数构造，first聚合函数尚未使用该参数（可以看到here）。但是，在Spark 2.0中，它将能够调用agg(F.first(col, True))来忽略空值（可以检查here）。

因此，对于Spark 1.6来说，这种方法必须是不同的，而且效率要低一些，这是非常有效的。一个想法如下：

from pyspark.sql import functions as F df1 = df.select('id', 'code').filter(df['code'].isNotNull()).groupBy(df['id']).agg(F.first(df['code'])) df2 = df.select('id', 'name').filter(df['name'].isNotNull()).groupBy(df['id']).agg(F.first(df['name'])) result = df1.join(df2, 'id') result.show() +---+-------------+-------------+ | id|first(code)()|first(name)()| +---+-------------+-------------+ | a| code1| name2| +---+-------------+-------------+

也许有更好的选择。如果我找到答案，我会编辑答案。

Answer 2

因为每个分组只有一个非空值，所以使用1.6中的min / max为我的目的工作：

(df
  .groupby("id")
  .agg(F.min("code"),
       F.min("name"))
  .show())

+---+---------+---------+
| id|min(code)|min(name)|
+---+---------+---------+
|  a|    code1|    name2|
+---+---------+---------+

获取组中的第一个非空值（Spark 1.6）

2 个答案: