在尝试使用减法方法的Spark Scala中,我收到以下错误
<console>:29: error: value subtract is not a member of org.apache.spark.sql.DataFrame
但是从以下链接我可以看到它存在于Python中
https://forums.databricks.com/questions/7505/comparing-two-dataframes.html https://spark.apache.org/docs/1.3.0/api/python/pyspark.sql.html?highlight=dataframe#pyspark.sql.DataFrame.subtract
我们在Spark Scala中是否有减法方法?如果不是它的替代品是什么?
我的示例代码如下所示:
scala> val myDf1 = sc.parallelize(Seq(1,2,2)).toDF
myDf1: org.apache.spark.sql.DataFrame = [value: int]
scala> val myDf2 = sc.parallelize(Seq(1,2)).toDF
myDf2: org.apache.spark.sql.DataFrame = [value: int]
scala> val result = myDf1.subtract(myDf2)
<console>:28: error: value subtract is not a member of org.apache.spark.sql.DataFrame
val result = myDf1.subtract(myDf2)
答案 0 :(得分:1)
那是因为subtract
并不存在,说实话,我不确定你要做什么:
scala> val df1 = sc.parallelize(Seq(1,2,2)).toDF
df1: org.apache.spark.sql.DataFrame = [value: int]
scala> val df2 = sc.parallelize(Seq(1,2)).toDF
df2: org.apache.spark.sql.DataFrame = [value: int]
scala> df1.except(df2).show
+-----+
|value|
+-----+
+-----+
但似乎你想要找到那些重复的东西而不是删除它们。
从头到尾:
scala> val dupes = df1.groupBy("value").count.filter("count > 1").drop("count")
dupes: org.apache.spark.sql.DataFrame = [value: int]
scala> dupes.show()
+-----+
|value|
+-----+
| 2|
+-----+