根据与另一个DataFrame列值匹配的列值来设置SparkR DataFrame

时间:2017-06-13 18:38:02

标签: apache-spark pyspark spark-dataframe sparkr

我有两个SparkR数据框,newHiresDFsalesTeamDF。我希望根据newHiresDF中的newHiresDF$name值获取salesTeamDF$name的子集,但我无法找到实现此目的的方法。以下是我尝试的代码。

#Create DataFrames
newHires <- data.frame(name = c("Thomas", "George", "Bill", "John"),
    surname = c("Smith", "Williams", "Brown", "Taylor"))
salesTeam <- data.frame(name = c("Thomas", "Bill", "George"),
    surname = c("Martin", "Clark", "Williams"))
newHiresDF <- createDataFrame(newHires)
salesTeamDF <- createDataFrame(salesTeam)
display(newHiresDF)

#Try to subset newHiresDF based on name values in salesTeamDF
#All of the below result in errors
NHsubset1 <- filter(newHiresDF, newHiresDF$name %in% salesTeamDF$name)
NHsubset2 <- filter(newHiresDF, intersect(select(newHiresDF, 'name'), 
    select(salesTeamDF, 'name')))
NHsubset3 <- newHiresDF[newHiresDF$name %in% salesTeamDF$name,] #This is how it would be done in R

#What I'd like NHsubset to look like:
    name  surname
1 Thomas    Smith
2 George Williams
3   Bill    Brown
如果您愿意,PySpark代码也可以使用。

1 个答案:

答案 0 :(得分:0)

在事后看来看似乎很简单的解决方案:只需使用merge

NHsubset <- merge(newHiresDF, select(salesTeamDF, 'name'))