如何从组中获取第一个非空值?我尝试将first与coalesce JNIEXPORT jobject JNICALL Java_package_Class_function (JNIEnv * env, jclass cls, ...){
// Find Java class
jclass c = (*env)->FindClass(env, "package/Tagged");
if (c == 0) {
printf("Find Class Failed.\n");
}else{
printf("Found class.\n");
}
// Find Tagged<T> constructor
jmethodID constructor = (*env)->GetMethodID(env,c, "<init>", "(Ljava/lang/Object;J)V");
if (constructor == 0) {
printf("Find method Failed.\n");
} else {
printf("Found method.\n");
}
// Get value
int * valptr = LibraryCall();
// check that constructor arguments are what we expect
int val = (int) *valptr;
printf("Value: %i\n",val);
long long addr = (long long) valptr;
printf("Address: %p = %lld = %p\n",valptr,addr,(void *)addr);
// Try to create Tagged object
jobject taggedval = (*env)->NewObject(env, c, constructor, val, addr);
printf("We never get here\n");
return taggedval;
}
一起使用但我没有获得所需的行为(我似乎得到了第一行)。
Found class.
Found method.
Value: 102583
Address: 0x7fdcc2d209b0 = 140586138077616 = 0x7fdcc2d209b0
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x0000000109ae9bcf, pid=42140, tid=3847
#
# JRE version: Java(TM) SE Runtime Environment (8.0_66-b17) (build 1.8.0_66-b17)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.66-b17 mixed mode bsd-amd64 compressed oops)
# Problematic frame:
# V [libjvm.dylib+0x2e9bcf] JavaCallArguments::parameters()+0x27
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /Users/eleanor/Documents/workspace/av-java/src/hs_err_pid42140.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
#
Abort trap: 6
我试过了:
F.first(F.coalesce("code"))
期望的输出
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import functions as F
sc = SparkContext("local")
sqlContext = SQLContext(sc)
df = sqlContext.createDataFrame([
("a", None, None),
("a", "code1", None),
("a", "code2", "name2"),
], ["id", "code", "name"])
答案 0 :(得分:13)
对于Spark 1.3 - 1.5,这可以解决问题:
from pyspark.sql import functions as F
df.groupBy(df['id']).agg(F.first(df['code']), F.first(df['name'])).show()
+---+-----------+-----------+
| id|FIRST(code)|FIRST(name)|
+---+-----------+-----------+
| a| code1| name2|
+---+-----------+-----------+
修改强>
显然,在1.6版本中,他们改变了first
聚合函数的处理方式。现在,底层类First
应该使用第二个参数ignoreNullsExpr
参数构造,first
聚合函数尚未使用该参数(可以看到here)。但是,在Spark 2.0中,它将能够调用agg(F.first(col, True))
来忽略空值(可以检查here)。
因此,对于Spark 1.6来说,这种方法必须是不同的,而且效率要低一些,这是非常有效的。一个想法如下:
from pyspark.sql import functions as F
df1 = df.select('id', 'code').filter(df['code'].isNotNull()).groupBy(df['id']).agg(F.first(df['code']))
df2 = df.select('id', 'name').filter(df['name'].isNotNull()).groupBy(df['id']).agg(F.first(df['name']))
result = df1.join(df2, 'id')
result.show()
+---+-------------+-------------+
| id|first(code)()|first(name)()|
+---+-------------+-------------+
| a| code1| name2|
+---+-------------+-------------+
也许有更好的选择。如果我找到答案,我会编辑答案。
答案 1 :(得分:1)
因为每个分组只有一个非空值,所以使用1.6中的min / max为我的目的工作:
(df
.groupby("id")
.agg(F.min("code"),
F.min("name"))
.show())
+---+---------+---------+
| id|min(code)|min(name)|
+---+---------+---------+
| a| code1| name2|
+---+---------+---------+