可以从here下载数据集
library(dplyr)
NBA <- read.csv("NBA Season Dataset/Seasons_Stats.csv")
NBA$Player <- as.character(NBA$Player)
PlayerData <- read.csv("NBA Season Dataset/player_data.csv")
PlayerData$name <- as.character(PlayerData$name)
我想从PlayerData
中获取他们的身高和体重,然后与主要数据NBA
合并。问题在于该NBA球员数据集包含一些与其他球员共享相同名称的球员,因此在将两个数据框与merge
合并为球员名称之前,我需要区分他们的名字。
PlayerData[duplicated(PlayerData$name), "name"]
给我50个重复的名字。
因此,我创建了一个函数,该函数将根据活跃数据的年份在两个数据框中重命名播放器:
unduplicate <- function(name, year_start, year_end, new_name) {
PlayerData[PlayerData$name == name & PlayerData$year_start == year_start & PlayerData$year_end == year_end, 1] = new_name
NBA[NBA$Player == name & NBA$Year <= year_end & NBA$Year >= year_start, "Player"] = new_name
}
然后调用该函数:
unduplicate("Dee Brown", 1991, 2002, "Dee Brown 1")
unduplicate("Dee Brown", 2007, 2009, "Dee Brown 2")
什么都没有改变...
但是,如果我手动这样做:
PlayerData[PlayerData$name == "Dee Brown" & PlayerData$year_start == 1991 & PlayerData$year_end == 2002, 1] = "Dee Brown 1"
NBA[NBA$Player == "Dee Brown" & NBA$Year <= 2002 & NBA$Year >= 1991, "Player"] = "Dee Brown 1"
PlayerData[PlayerData$name == "Dee Brown" & PlayerData$year_start == 2007 & PlayerData$year_end == 2009, 1] = "Dee Brown 2"
NBA[NBA$Player == "Dee Brown" & NBA$Year <= 2009 & NBA$Year >= 2007, "Player"] = "Dee Brown 2"
所以我的问题是
1)函数有什么问题?我检查并尝试了许多变体,但没有用。
2)有什么更好的方法来解决这个问题?
我对此很陌生,所以如果这只是愚蠢的初学者的错误,请原谅我。
谢谢!
答案 0 :(得分:1)
您可以使用与dplyr不同的方法来根据变量集选择唯一的玩家。 Sqldf库提供了根据条件与不等式合并表的可能性:
library(dplyr)
player_data <- read.csv("player_data.csv", stringsAsFactors = F)
Players <- read.csv("Players.csv", stringsAsFactors = F)
NBA1<- read.csv("Seasons_Stats.csv", stringsAsFactors = F)
Dist_players <-player_data%>%
distinct(name, year_start, year_end, height, weight )
library(sqldf)
Final <- sqldf("SELECT * FROM NBA1 JOIN Dist_players ON NBA1.Player = Dist_players.name
WHERE NBA1.Year >= Dist_players.year_start AND NBA1.Year <= Dist_players.year_end")