我需要一些帮助来转换我的数据集。我将不胜感激任何帮助或反馈。
我有过去50年的大学橄榄球比分数据。我目前有一个像图1中的数据框,我需要得到一个类似于图片2的数据框。我想要得到的数据框需要有一个每年所有团队的连锁列表,还有两列跟踪分别赢得亏损。连锁列表必须是每年特定的。所以基本上像图2这样的数据框,但是每年都有数据。
这是获取清理数据帧的代码,类似于我在第一张图片中的数据框。
# Make generic data frame and get data
practice = data.frame('a'=character(), 'b'=character(), 'c'= numeric(), 'd'=character(), 'e'= numeric(), 'f'=character())
widths = c(10, 28, 5, 28, 3, 19)
years = 1960:2010
for (i in years){
football_page = paste('http://homepages.cae.wisc.edu/~dwilson/rsfc/history/howell/cf', i, 'gms.txt',sep = '')
get_data = read.fwf(football_page, widths)
practice = rbind(practice, get_data)
}
heading = list('DATE', 'AWAY TEAM', 'AWAY SCORE', 'HOME TEAM', 'HOME SCORE', 'LOCATION')
colnames(practice) = heading
# Fixing season dates
practice = cbind('SEASON'=numeric(nrow(practice)),practice)
fix_date = matrix(0, nrow = nrow(practice))
for (j in 1:nrow(fix_date)){
fix_date[j,1] = substr(practice[j,2],7,10)
}
fix_date = as.numeric(fix_date)
practice$SEASON = fix_date
for (j in 1:nrow(practice)){
if (grepl('01/.......', practice[j,2]))
practice[j,1] = practice[j,1]-1
}
#fix names
practice[,3]=gsub(' ','',practice[,3])
practice[,5]=gsub(' ','',practice[,5])
#drop location and columns
practice = practice[, -7]
practice = practice[, -2]
数据集称为练习。
答案 0 :(得分:0)
如果没有您的数据样本或类似的内容,我无法对其进行全面测试,但我认为这样做可以。
# Create a function to get win and loss counts by season for a single team
teamsum <- function(teamname) {
require(dplyr)
df <- practice %>%
# Reduce the data set to games involving a single team
filter(AWAY TEAM==teamname | HOME TEAM==teamname) %>%
# Create a 0/1 indicator for whether or not that team won each of those games. Note
# that ties will get treated as losses here; you could change that with a more
# complicated set of if/else statements
mutate(team = teamname,
win = ifelse((AWAY TEAM==teamname & AWAY SCORE > HOME SCORE) |
(HOME TEAM==teamname & HOME SCORE > AWAY SCORE), 1, 0)) %>%
# Group the data by season for the summing to follow
group_by(SEASON) %>%
# Reduce the data to a table with counts of wins and losses by season
summarise(wins = sum(win),
losses = n() - sum(win)) %>%
# Add the team name as an id column to that summary table. In dplyr piping, '.' is
# the object created by the preceding step in the pipeline -- here, that summary
# table of wins and losses.
cbind(team = rep(teamname, nrow(.)), .) %>%
return(df)
}
# Apply that function to a vector of unique team names to make a list with
# tables of win & loss counts by season for each team in the original data.
# This version assumes that every team was the home team at least once.
teamlist <- lapply(unique(practice[,"HOME TEAM"]), teamsum)
# Merge the elements of that list into a single data frame. You could rbind, too.
df <- Reduce(function(...) merge(...), teamlist)
答案 1 :(得分:0)
另一个 dplyr 回答
我使用您的代码获取数据集,然后我复制团队列作为重塑数据集的关键,您可以使用相同的概念来实现基础R中的目标。
library(dplyr)
library(tidyr)
practice_2 <- practice %>%
mutate(home = `HOME TEAM`,
away = `AWAY TEAM`) %>%
# transform dataset to long format with `tidyr::gather()`
gather(LOC, TEAM, 6:7) %>%
group_by(SEASON, TEAM) %>%
mutate(won = ifelse(LOC == "home",
as.numeric(`HOME SCORE` > `AWAY SCORE`),
as.numeric(`AWAY SCORE` > `HOME SCORE`)),
lost = ifelse(LOC == "home",
as.numeric(`HOME SCORE` <= `AWAY SCORE`),
as.numeric(`AWAY SCORE` <= `HOME SCORE`)),
op = ifelse(LOC == "home", `AWAY TEAM`, `HOME TEAM`)) %>%
summarise(WINS = sum(won, na.rm = TRUE),
LOSSES = sum(lost, na.rm = TRUE),
OPPONENTS = list(unique(op)))