过滤器不接受整数? Spark数据框

时间:2019-10-17 08:00:27

标签: scala dataframe apache-spark

我正在使用Spark Dataframe处理Yelp数据集。我在使用filter()时遇到问题。

似乎我不能指定整数,只能指定字符串?

这是我的代码

def fiveStarBusinessesDF(yelpBusinesses: DataFrame):DataFrame = {
    yelpBusinesses.select("name", "stars", "review_count").filter("stars" == 5, "review_count" >= 1000)
  }

这是yelp数据集中的一行:

{"business_id":"1SWheh84yJXfytovILXOAQ","name":"Arizona Biltmore Golf Club","address":"2818 E Camino Acequia Drive","city":"Phoenix","state":"AZ","postal_code":"85016","latitude":33.5221425,"longitude":-112.0184807,"stars":3.0,"review_count":5,"is_open":0,"attributes":{"GoodForKids":"False"},"categories":"Golf, Active Life","hours":null}

显然,星号和review_count都是整数,而不是字符串。

我的函数的输出应该是一个DataFrame,其中所有业务的名称,星号和review_count为5星,并且大于或等于1000个review_count。

3 个答案:

答案 0 :(得分:1)

尝试强制转换为int

    import spark.implicits._
    def fiveStarBusinessesDF(yelpBusinesses: DataFrame):DataFrame = {
        yelpBusinesses.select('name, 'stars, 'review_count)
                      .filter('stars.cast("int") === 5 || 'review_count.cast("int") >= 1000)
      }

答案 1 :(得分:1)

尝试使用此功能:

def fiveStarBusinessesDF(yelpBusinesses: DataFrame):DataFrame = {
    yelpBusinesses.select("name", "stars", "review_count")
                  .filter("$stars" == 5 && "$review_count" >= 1000)
  }

或类似这样:

import org.apache.spark.sql.functions._

def fiveStarBusinessesDF(yelpBusinesses: DataFrame):DataFrame = {
        yelpBusinesses.select("name", "stars", "review_count")
                      .filter(col("stars") == lit(5) && col("review_count") >= lit(1000))
      }

答案 2 :(得分:1)

我会尝试:

    import spark.implicits._
    def fiveStarBusinessesDF(yelpBusinesses: DataFrame):DataFrame = {
       yelpBusinesses.select("name", "stars", "review_count")
                   .filter($"stars" === 5 && $"review_count" >= 1000)
    }