我有一个由用户ID和操作类型
组成的动作数组+-------+-------+
|user_id| type|
+-------+-------+
| 11| SEARCH|
+-------+-------+
| 11| DETAIL|
+-------+-------+
| 12| SEARCH|
+-------+-------+
我想过滤属于至少有一次搜索操作的用户的操作。
因此,我创建了一个具有SEARCH操作的用户ID的bloom过滤器。
然后我尝试根据bloom过滤器的用户状态过滤所有操作
val df = spark.read...
val searchers = df.filter($"type" === "SEARCH").select("user_id").distinct.as[String].collect
val bloomFilter = BloomFilter.create(100)
searchers.foreach(bloomFilter.putString(_))
df.filter(bloomFilter.mightContainString($"user_id"))
但代码提供了异常
type mismatch;
found : org.apache.spark.sql.ColumnName
required: String
请告诉我如何将列值传递给BloomFilter.mightContainString方法?
答案 0 :(得分:0)
创建过滤器:
val expectedNumItems: Long = ???
val fpp: Double = ???
val f = df.stat.bloomFilter("user_id", expectedNumItems, fpp)
使用udf
进行过滤:
import org.apache.spark.sql.functions.udf
val mightContain = udf((s: String) => f.mightContain(s))
df.filter(mightContain($"user_id"))
如果您当前的Bloom过滤器实现是可序列化的,您应该能够以相同的方式使用它,但如果数据足够大以证明Bloom过滤器的合理性,则应避免收集。
答案 1 :(得分:-1)
你可以这样做,
@Provider
public class ObjectMapperContextResolver implements ContextResolver {
private static final ObjectMapper mapper = new ObjectMapper();
private static Logger logger = Logger.getLogger(ObjectMapperContextResolver.class);
static {
logger.info("starting...");
mapper.setSerializationInclusion(Include.NON_EMPTY);
mapper.disable(MapperFeature.USE_GETTERS_AS_SETTERS);
mapper.enable(DeserializationFeature.UNWRAP_ROOT_VALUE);
mapper.enable(SerializationFeature.WRAP_ROOT_VALUE);
}
public ObjectMapperContextResolver() {
super();
logger.info("Init ObjectMapperContextResolver");
}
@Override
public ObjectMapper getContext(Class<?> type) {
logger.info("ObjectMapperContextResolver.getContext() called with type: "+type);
return mapper;
}
}
此时,我提到}
public ObjectMapperContextResolver() {
super();
logger.info("Init ObjectMapperContextResolver");
}
@Override
public ObjectMapper getContext(Class<?> type) {
logger.info("ObjectMapperContextResolver.getContext() called with type: "+type);
return mapper;
}
不是一个好主意的事实。接下来你可以做类似的事情。
val sparkSession = ???
val sc = sparkSession.sparkContext
val bloomFilter = BloomFilter.create(100)
val df = ???
val searchers = df.filter($"type" === "SEARCH").select("user_id").distinct.as[String].collect
如果bloomFilter实例可序列化,则可以删除广播。
希望这有帮助,干杯。