转换/重塑数据集:过去50年来对大学橄榄球队进行排名

时间:2015-11-08 06:56:32

标签: regex r transform

我需要一些帮助来转换我的数据集。我将不胜感激任何帮助或反馈。

What I currently have

What I'm trying to get

我有过去50年的大学橄榄球比分数据。我目前有一个像图1中的数据框,我需要得到一个类似于图片2的数据框。我想要得到的数据框需要有一个每年所有团队的连锁列表,还有两列跟踪分别赢得亏损。连锁列表必须是每年特定的。所以基本上像图2这样的数据框,但是每年都有数据。

这是获取清理数据帧的代码,类似于我在第一张图片中的数据框。

# Make generic data frame and get data

practice = data.frame('a'=character(), 'b'=character(), 'c'= numeric(), 'd'=character(), 'e'= numeric(), 'f'=character())
widths = c(10, 28, 5, 28, 3, 19)
years = 1960:2010
for (i in years){
  football_page = paste('http://homepages.cae.wisc.edu/~dwilson/rsfc/history/howell/cf', i, 'gms.txt',sep = '')
  get_data = read.fwf(football_page, widths)
  practice = rbind(practice, get_data)
}

heading = list('DATE', 'AWAY TEAM', 'AWAY SCORE', 'HOME TEAM', 'HOME SCORE', 'LOCATION')
colnames(practice) = heading


# Fixing season dates

practice = cbind('SEASON'=numeric(nrow(practice)),practice)
fix_date = matrix(0, nrow = nrow(practice))
for (j in 1:nrow(fix_date)){
  fix_date[j,1] = substr(practice[j,2],7,10)
}
fix_date = as.numeric(fix_date)
practice$SEASON = fix_date
for (j in 1:nrow(practice)){
  if (grepl('01/.......', practice[j,2]))
    practice[j,1] = practice[j,1]-1 
}


#fix names

practice[,3]=gsub(' ','',practice[,3])
practice[,5]=gsub(' ','',practice[,5])


#drop location and columns

practice = practice[, -7]
practice = practice[, -2]

数据集称为练习。

2 个答案:

答案 0 :(得分:0)

如果没有您的数据样本或类似的内容,我无法对其进行全面测试,但我认为这样做可以。

# Create a function to get win and loss counts by season for a single team
teamsum <- function(teamname) {
  require(dplyr)
  df <- practice %>%
    # Reduce the data set to games involving a single team
    filter(AWAY TEAM==teamname | HOME TEAM==teamname) %>%
    # Create a 0/1 indicator for whether or not that team won each of those games. Note
    # that ties will get treated as losses here; you could change that with a more
    # complicated set of if/else statements
    mutate(team = teamname,
       win = ifelse((AWAY TEAM==teamname & AWAY SCORE > HOME SCORE) |
        (HOME TEAM==teamname & HOME SCORE > AWAY SCORE), 1, 0)) %>%
    # Group the data by season for the summing to follow
    group_by(SEASON) %>%
    # Reduce the data to a table with counts of wins and losses by season
    summarise(wins = sum(win),
      losses = n() - sum(win)) %>%
    # Add the team name as an id column to that summary table. In dplyr piping, '.' is
    # the object created by the preceding step in the pipeline -- here, that summary
    # table of wins and losses.
    cbind(team = rep(teamname, nrow(.)), .) %>%
  return(df)
}

# Apply that function to a vector of unique team names to make a list with
# tables of win & loss counts by season for each team in the original data.
# This version assumes that every team was the home team at least once.
teamlist <- lapply(unique(practice[,"HOME TEAM"]), teamsum)

# Merge the elements of that list into a single data frame. You could rbind, too.
df <- Reduce(function(...) merge(...), teamlist)

答案 1 :(得分:0)

另一个 dplyr 回答

我使用您的代码获取数据集,然后我复制团队列作为重塑数据集的关键,您可以使用相同的概念来实现基础R中的目标。

library(dplyr)
library(tidyr)

practice_2 <- practice %>%
  mutate(home = `HOME TEAM`,
         away = `AWAY TEAM`) %>% 
  # transform dataset to long format with `tidyr::gather()`
  gather(LOC, TEAM, 6:7) %>% 
  group_by(SEASON, TEAM) %>%
  mutate(won  = ifelse(LOC == "home",
                      as.numeric(`HOME SCORE` >  `AWAY SCORE`),
                      as.numeric(`AWAY SCORE` >  `HOME SCORE`)),
         lost = ifelse(LOC == "home",
                      as.numeric(`HOME SCORE` <= `AWAY SCORE`),
                      as.numeric(`AWAY SCORE` <= `HOME SCORE`)),
         op = ifelse(LOC == "home", `AWAY TEAM`, `HOME TEAM`)) %>% 
  summarise(WINS   = sum(won, na.rm = TRUE),
            LOSSES = sum(lost, na.rm = TRUE),
            OPPONENTS = list(unique(op)))