如何在Spark中检查两个DataFrame列的交集

时间:2017-05-24 21:00:38

标签: apache-spark pyspark sparkr

使用pysparksparkr(最好是两者),如何才能获得两个DataFrame列的交集?例如,在sparkr中,我有以下DataFrames

newHires <- data.frame(name = c("Thomas", "George", "George", "John"),
                       surname = c("Smith", "Williams", "Brown", "Taylor"))
salesTeam <- data.frame(name = c("Lucas", "Bill", "George"),
                        surname = c("Martin", "Clark", "Williams"))
newHiresDF <- createDataFrame(newHires)
salesTeamDF <- createDataFrame(salesTeam)

#Intersect works for the entire DataFrames
newSalesHire <- intersect(newHiresDF, salesTeamDF)
head(newSalesHire)

        name  surname
    1 George Williams

#Intersect does not work for single columns
newSalesHire <- intersect(newHiresDF$name, salesTeamDF$name)
head(newSalesHire)

   Error in as.vector(y) : no method for coercing this S4 class to a vector

如何让intersect为单列工作?

1 个答案:

答案 0 :(得分:5)

您需要两个Spark DataFrame才能使用intersect函数。您可以使用select函数从每个DataFrame中获取特定列。

在SparkR中:

newSalesHire <- intersect(select(newHiresDF, 'name'), select(salesTeamDF,'name'))

在pyspark:

newSalesHire = newHiresDF.select('name').intersect(salesTeamDF.select('name'))