Question

我想让Spark数据帧通过＆＃39;像＆＃39;来搜索内容。我们可以使用＆＃39;或＆＃39;功能就像SQL＆＃39; ||＆＃39;像这样过滤。

voc_0201.filter(
  col("contents").like("intel").or(col("contents").like("apple"))
).count

但是我必须过滤很多字符串，我怎么能将String列表或数组过滤到数据帧？

由于

Answer 1

让我们首先定义我们的patterns：

val patterns = Seq("foo", "bar")

并创建示例DataFrame：

val df = Seq((1, "bar"), (2, "foo"), (3, "xyz")).toDF("id", "contents")

一个简单的解决方案是fold超过patterns：

val expr = patterns.foldLeft(lit(false))((acc, x) => 
  acc || col("contents").like(x)
)

df.where(expr).show

// +---+--------+
// | id|contents|
// +---+--------+
// |  1|     bar|
// |  2|     foo|
// +---+--------+

另一个是构建正则表达式并使用rlike：

val expr = patterns.map(p => s"^$p$$").mkString("|")
df.where(col("contents").rlike(expr)).show

// +---+--------+
// | id|contents|
// +---+--------+
// |  1|     bar|
// |  2|     foo|
// +---+--------+

PS：如果这不是简单的文字，则上述解决方案可能无效。

最后，对于简单模式，您可以使用isin：

df.where(col("contents").isin(patterns: _*)).show

// +---+--------+ 
// | id|contents|
// +---+--------+
// |  1|     bar|
// |  2|     foo|
// +---+--------+

也可以加入：

val patternsDF = patterns.map(Tuple1(_)).toDF("contents")
df.join(broadcast(patternsDF), Seq("contents")).show

// +---+--------+ 
// | id|contents|
// +---+--------+
// |  1|     bar|
// |  2|     foo|
// +---+--------+

Apache Spark SQL数据帧按字符串过滤多规则

1 个答案: