如何在使用带scala的spark过滤器时将数据集列值传递给函数?

时间:2018-04-16 21:12:38

标签: scala apache-spark bloom-filter

我有一个由用户ID和操作类型

组成的动作数组
+-------+-------+
|user_id|   type|
+-------+-------+
|     11| SEARCH|
+-------+-------+
|     11| DETAIL|
+-------+-------+
|     12| SEARCH|
+-------+-------+

我想过滤属于至少有一次搜索操作的用户的操作。

因此,我创建了一个具有SEARCH操作的用户ID的bloom过滤器。

然后我尝试根据bloom过滤器的用户状态过滤所有操作

val df = spark.read...
val searchers = df.filter($"type" === "SEARCH").select("user_id").distinct.as[String].collect
val bloomFilter = BloomFilter.create(100)
searchers.foreach(bloomFilter.putString(_))
df.filter(bloomFilter.mightContainString($"user_id"))

但代码提供了异常

type mismatch;
found   : org.apache.spark.sql.ColumnName
required: String

请告诉我如何将列值传递给BloomFilter.mightContainString方法?

2 个答案:

答案 0 :(得分:0)

创建过滤器:

val expectedNumItems: Long = ???
val fpp: Double = ???
val f = df.stat.bloomFilter("user_id", expectedNumItems, fpp)

使用udf进行过滤:

import org.apache.spark.sql.functions.udf

val mightContain = udf((s: String) => f.mightContain(s))
df.filter(mightContain($"user_id"))

如果您当前的Bloom过滤器实现是可序列化的,您应该能够以相同的方式使用它,但如果数据足够大以证明Bloom过滤器的合理性,则应避免收集。

答案 1 :(得分:-1)

你可以这样做,

@Provider public class ObjectMapperContextResolver implements ContextResolver { private static final ObjectMapper mapper = new ObjectMapper(); private static Logger logger = Logger.getLogger(ObjectMapperContextResolver.class); static { logger.info("starting..."); mapper.setSerializationInclusion(Include.NON_EMPTY); mapper.disable(MapperFeature.USE_GETTERS_AS_SETTERS); mapper.enable(DeserializationFeature.UNWRAP_ROOT_VALUE); mapper.enable(SerializationFeature.WRAP_ROOT_VALUE);

}

public ObjectMapperContextResolver() {
    super();
    logger.info("Init ObjectMapperContextResolver");

}

@Override
public ObjectMapper getContext(Class<?> type) {
    logger.info("ObjectMapperContextResolver.getContext() called with type: "+type);
    return mapper;
}

}

此时,我提到} public ObjectMapperContextResolver() { super(); logger.info("Init ObjectMapperContextResolver"); } @Override public ObjectMapper getContext(Class<?> type) { logger.info("ObjectMapperContextResolver.getContext() called with type: "+type); return mapper; } 不是一个好主意的事实。接下来你可以做类似的事情。

val sparkSession = ???
val sc = sparkSession.sparkContext

val bloomFilter = BloomFilter.create(100)

val df = ???

val searchers = df.filter($"type" === "SEARCH").select("user_id").distinct.as[String].collect

如果bloomFilter实例可序列化,则可以删除广播。

希望这有帮助,干杯。