我正在使用Spark Dataframe处理Yelp数据集。我在使用filter()时遇到问题。
似乎我不能指定整数,只能指定字符串?
这是我的代码
def fiveStarBusinessesDF(yelpBusinesses: DataFrame):DataFrame = {
yelpBusinesses.select("name", "stars", "review_count").filter("stars" == 5, "review_count" >= 1000)
}
这是yelp数据集中的一行:
{"business_id":"1SWheh84yJXfytovILXOAQ","name":"Arizona Biltmore Golf Club","address":"2818 E Camino Acequia Drive","city":"Phoenix","state":"AZ","postal_code":"85016","latitude":33.5221425,"longitude":-112.0184807,"stars":3.0,"review_count":5,"is_open":0,"attributes":{"GoodForKids":"False"},"categories":"Golf, Active Life","hours":null}
显然,星号和review_count都是整数,而不是字符串。
我的函数的输出应该是一个DataFrame,其中所有业务的名称,星号和review_count为5星,并且大于或等于1000个review_count。
答案 0 :(得分:1)
尝试强制转换为int
import spark.implicits._
def fiveStarBusinessesDF(yelpBusinesses: DataFrame):DataFrame = {
yelpBusinesses.select('name, 'stars, 'review_count)
.filter('stars.cast("int") === 5 || 'review_count.cast("int") >= 1000)
}
答案 1 :(得分:1)
尝试使用此功能:
def fiveStarBusinessesDF(yelpBusinesses: DataFrame):DataFrame = {
yelpBusinesses.select("name", "stars", "review_count")
.filter("$stars" == 5 && "$review_count" >= 1000)
}
或类似这样:
import org.apache.spark.sql.functions._
def fiveStarBusinessesDF(yelpBusinesses: DataFrame):DataFrame = {
yelpBusinesses.select("name", "stars", "review_count")
.filter(col("stars") == lit(5) && col("review_count") >= lit(1000))
}
答案 2 :(得分:1)
我会尝试:
import spark.implicits._
def fiveStarBusinessesDF(yelpBusinesses: DataFrame):DataFrame = {
yelpBusinesses.select("name", "stars", "review_count")
.filter($"stars" === 5 && $"review_count" >= 1000)
}