合并数据框会创建重复的行

时间:2018-09-06 04:52:54

标签: r merge screen-scraping dbplyr

我正在编写一个脚本,该脚本可收集mlb游戏日数据并将其写入Excel文档中。我遇到的问题是合并。它正在创建多余的行,并且其中大多数似乎是重复的。我似乎无法弄清楚为什么以及如何预防它。预期的输出行数为1313。我需要做些什么来纠正此问题?

library(dplyr)
library(dbplyr)
library(pitchRx)
library(RSQLite)
library(XML2R)
library(ggplot2)

files <- c("inning/inning_hit.xml", "players.xml", "miniscoreboard.xml")
my_db <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

#Scrape MLB gameday
scrape(start = "2016-08-08", end = "2016-08-09", connect = my_db, suffix = files)

#Create locations data frame and fill with hit coordinates
locations <- select(tbl(my_db, "hip"), des, x, y, batter, pitcher, type, team, inning)


#Rename ids
names(locations)[names(locations) == 'batter'] <- 'batter.id'
names(locations)[names(locations) == 'pitcher'] <- 'pitcher.id'

#Remove gid from
dbGetQuery(my_db, 'UPDATE player SET gameday_link = trim(gameday_link, "gid_")')


#create batters, pitchers and stadium dataframe
batters <- select(tbl(my_db, "player"), first, last, id, bats, team_abbrev, rl, gameday_link)

pitchers <- select(tbl(my_db, "player"), first, last, id)  

stadium <- select(tbl(my_db, "game"), original_date, home_team_name, gameday_link)  


#merge dataframes together
merge <- merge(locations, batters, by.x="batter", by.y="id", all.y=F)

merge2 <- merge(merge, pitchers, by.x="batter", by.y="id", all.x=F)

merge3 <- merge(merge2, stadium, by.x="gameday_link", by.y= "gameday_link", all.x=F)

merge3 <- merge3[!duplicated(merge3[c("x","y"),]),]

write.csv(merge3, file = "MyFileName.csv")

0 个答案:

没有答案