如何根据项目或说明列表删除数据框中不需要的行

时间:2019-06-12 15:44:12

标签: scala list apache-spark dataframe filter

我有一个带有完整IP地址的数据框。 我有一个要从数据框中删除的IP地址列表。 我想根据“ lista”删除所有IP地址后,有一个新的数据框“ filtered_list”。

我在How to use NOT IN clause in filter condition in spark看到了一个例子。但是,即使在过滤器上执行“ not”操作之前,我似乎也无法使其正常工作。请帮助。

示例:

var df = Seq("119.73.148.227", "42.61.124.218", "42.61.66.174", "118.201.94.2","118.201.149.146", "119.73.234.82", "42.61.110.239", "58.185.72.118", "115.42.231.178").toDF("ipAddress")

var lista = List("119.73.148.227", "118.201.94.2")

var filtered_list = df.filter(col("ipAddress").isin(lista))

我遇到以下错误消息:

java.lang.RuntimeException: Unsupported literal type class scala.collection.immutable.$colon$colon List(119.73.148.227, 118.201.94.2)
  at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:77)
  at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:163)
  at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:163)
  at scala.util.Try.getOrElse(Try.scala:79)
  at org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:162)
  at org.apache.spark.sql.functions$.typedLit(functions.scala:113)
  at org.apache.spark.sql.functions$.lit(functions.scala:96)
  at org.apache.spark.sql.Column$$anonfun$isin$1.apply(Column.scala:787)
  at org.apache.spark.sql.Column$$anonfun$isin$1.apply(Column.scala:787)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.AbstractTraversable.map(Traversable.scala:104)
  at org.apache.spark.sql.Column.isin(Column.scala:787)
  ... 52 elided

2 个答案:

答案 0 :(得分:2)

您可以在数据框上使用except方法。

var df = Seq("119.73.148.227", "42.61.124.218", "42.61.66.174", "118.201.94.2","118.201.149.146", "119.73.234.82", "42.61.110.239", "58.185.72.118", "115.42.231.178").toDF("ipAddress")

var lista = Seq("119.73.148.227", "118.201.94.2").toDF("ipAddress")

var onlyWantedIp = df.except(lista)

答案 1 :(得分:1)

isin使用varargs,而不是List。您必须使用:_*归因将列表分散为单独的元素:

var filtered_list = df.filter(col("ipAddress").isin(lista: _*))