Question

我有一个由用户ID和操作类型

组成的动作数组

+-------+-------+
|user_id|   type|
+-------+-------+
|     11| SEARCH|
+-------+-------+
|     11| DETAIL|
+-------+-------+
|     12| SEARCH|
+-------+-------+

我想过滤属于至少有一次搜索操作的用户的操作。

因此，我创建了一个具有SEARCH操作的用户ID的bloom过滤器。

然后我尝试根据bloom过滤器的用户状态过滤所有操作

val df = spark.read...
val searchers = df.filter($"type" === "SEARCH").select("user_id").distinct.as[String].collect
val bloomFilter = BloomFilter.create(100)
searchers.foreach(bloomFilter.putString(_))
df.filter(bloomFilter.mightContainString($"user_id"))

但代码提供了异常

type mismatch;
found   : org.apache.spark.sql.ColumnName
required: String

请告诉我如何将列值传递给BloomFilter.mightContainString方法？

Answer 1

创建过滤器：

val expectedNumItems: Long = ???
val fpp: Double = ???
val f = df.stat.bloomFilter("user_id", expectedNumItems, fpp)

使用udf进行过滤：

import org.apache.spark.sql.functions.udf

val mightContain = udf((s: String) => f.mightContain(s))
df.filter(mightContain($"user_id"))

如果您当前的Bloom过滤器实现是可序列化的，您应该能够以相同的方式使用它，但如果数据足够大以证明Bloom过滤器的合理性，则应避免收集。

Answer 2

你可以这样做，



@Provider
public class ObjectMapperContextResolver implements ContextResolver {
    private static final ObjectMapper mapper = new ObjectMapper();
    private static Logger logger = Logger.getLogger(ObjectMapperContextResolver.class);
    static {
        logger.info("starting...");
        mapper.setSerializationInclusion(Include.NON_EMPTY);
        mapper.disable(MapperFeature.USE_GETTERS_AS_SETTERS);
        mapper.enable(DeserializationFeature.UNWRAP_ROOT_VALUE);
        mapper.enable(SerializationFeature.WRAP_ROOT_VALUE);

}

public ObjectMapperContextResolver() {
    super();
    logger.info("Init ObjectMapperContextResolver");

}

@Override
public ObjectMapper getContext(Class<?> type) {
    logger.info("ObjectMapperContextResolver.getContext() called with type: "+type);
    return mapper;
}


}

此时，我提到} public ObjectMapperContextResolver() { super(); logger.info("Init ObjectMapperContextResolver"); } @Override public ObjectMapper getContext(Class<?> type) { logger.info("ObjectMapperContextResolver.getContext() called with type: "+type); return mapper; }不是一个好主意的事实。接下来你可以做类似的事情。

val sparkSession = ???
val sc = sparkSession.sparkContext

val bloomFilter = BloomFilter.create(100)

val df = ???

val searchers = df.filter($"type" === "SEARCH").select("user_id").distinct.as[String].collect

如果bloomFilter实例可序列化，则可以删除广播。

希望这有帮助，干杯。

如何在使用带scala的spark过滤器时将数据集列值传递给函数？

2 个答案: