Question

我有一个糟糕的UDF（注意：不过滤udf！）什么基本上随机失败。当我在DataFrame上注册的temptable上使用SQL执行此UDF时，它似乎正确执行，我甚至没有意识到问题，直到我在测试中使用了一些随机输入，我只是检查了我是否得到任何结果。当我开发出一个有意义的单元测试时，我经历过我的输出行数少于预期。原因是我的UDF存在问题，但奇怪的是错误是沉默的。或者至少我在日志中看不到任何错误，我无法捕获它刚刚执行的异常，当它在某一行上失败时，结果并不包含转换后的行。（问题与数据无关）。

有没有办法检测出这样的问题？或者这是火花的预期行为？

（Spark 1.6.1）

UDF问题：基本上发生了什么，UDF得到一个HashMap，而不是复制keySet我不小心通过keySet（）获取keySet，这是＆＃34;返回此映射中包含的键的Set视图。＆＃34;我从这个集合中删除了元素，所以我搞砸了HashMap，这导致了UDF中的问题。并且似乎当在实际行中处理给定字段期间导致错误时，整个行从结果DataFrame中消失。这更像是一个提示。

更新：这是（匿名版本）代码，我在其中体验了我解释过的行为。

UDF导致问题的原因

public class AnonymUdf {
public static Tuple2<Set<Long>, Set<Long>> processing(WrappedArray<GenericRowWithSchema> inputData, HashMap<Long, List<Long>> someInputMap, Long actionTime) {
    Set<Long> someSet = new HashSet<>();
    Set<Long> anotherSet = new HashSet<>();
    // This is the problematic line, keySet provide a reference; the input set should not change
    Set<Long> theInputKeySet = someInputMap.keySet();
    // This is the fix: to copy the key set
    // Set<Long> theInputKeySet = someInputMap.keySet().stream().collect(Collectors.toSet());
    if (someInputMap != null && someInputMap.size() > 0) {
        Iterator<GenericRowWithSchema> it = inputData.iterator();
        int idIdx = -1;
        int idTypeIdx = -1;
        int timeIdx = -1;
        while (it.hasNext()) {
            GenericRowWithSchema r = it.next();
            if (idIdx == -1) {
                idIdx = r.fieldIndex("beacon_id");
            }
            if (idTypeIdx == -1) {
                idTypeIdx = r.fieldIndex("beacon_type_id");
            }
            if (timeIdx == -1) {
                timeIdx = r.fieldIndex("action_time");
            }
            Object o = r.get(idTypeIdx);
            Integer beaconIdType = o instanceof Long ? ((Long) o).intValue() : (Integer) o;
            if (someInputMap.containsKey(r.getLong(idIdx))) {
                // if given beacon id is part of the beacon id universe (someInputMap)
                if (2 == beaconIdType && actionTime <= r.getLong(timeIdx)) {
                    // if the beacon event is conversion (type 2) and happened recently
                    someSet.add(r.getLong(idIdx));
                } else {
                    anotherSet.add(r.getLong(idIdx));
                }
            }
        }
    }
    // And this screw up the someInputMap's keySet
    theInputSet.removeAll(someSet);
    theInputSet.removeAll(anotherSet);

    return new Tuple2<>(someSet, theInputSet);
}
}

调用UDF：

...
sqlContext.udf().register("myUdf", (WrappedArray<GenericRowWithSchema> ybe) ->
        AnonymUdf.processing(input, map, time),
        DataTypes.createStructType(
        Arrays.asList(
        DataTypes.createStructField("result", DataTypes.createArrayType(DataTypes.LongType), false),
        DataTypes.createStructField("result2", DataTypes.createArrayType(DataTypes.LongType), false)
        ))
        );

dataFrame1.registerTempTable("tempTable");
DataFrame dataFrame2 = sqlContext.sql("SELECT id, ..." +
        "myUdf(field) AS result FROM tempTable");
return dataFrame2;

Java Spark - UDF可能有无提示错误吗？

0 个答案: