使用dplyr mutate()规范化数据会带来不一致

时间:2017-11-22 11:30:38

标签: r dplyr mutate

我正在尝试使用以下代码从此博客帖子http://www.luishusier.com/2017/09/28/balance/重现该框架,但看起来我的结果不一致

library(tidyverse)
library(magrittr)

ids <- c("1617", "1516", "1415", "1314", "1213", "1112", "1011", "0910", "0809", "0708", "0607", "0506")

data <- ids %>% 
  map(function(i) {read_csv(paste0("http://www.football-data.co.uk/mmz4281/", i ,"/F1.csv")) %>% 
      select(Date:AST) %>%
      mutate(season = i)})

data <- bind_rows(data)

data <- data[complete.cases(data[ , 1:3]), ]

tmp1 <- data %>% 
  select(season, HomeTeam, FTHG:FTR,HS:AST) %>%
  rename(BP = FTHG,
         BC = FTAG,
         TP = HS,
         TC = AS,
         TCP = HST,
         TCC = AST,
         team = HomeTeam)%>%
  mutate(Pts = ifelse(FTR == "H", 3, ifelse(FTR == "A", 0, 1)), 
         Terrain = "Domicile")

tmp2 <- data %>% 
  select(season, AwayTeam, FTHG:FTR, HS:AST) %>%
  rename(BP = FTAG,
         BC = FTHG,
         TP = AS,
         TC = HS,
         TCP = AST,
         TCC = HST,
         team = AwayTeam)%>%
  mutate(Pts = ifelse(FTR == "A", 3 ,ifelse(FTR == "H", 0 , 1)),
         Terrain = "Extérieur")

tmp3 <- bind_rows(tmp1, tmp2)

l1_0517 <- tmp3 %>%
  group_by(season, team)%>%
  summarise(j = n(),
            pts = sum(Pts),
            diff_but = (sum(BP) - sum(BC)),
            diff_t_ca = (sum(TCP, na.rm = T) - sum(TCC, na.rm = T)),
            diff_t = (sum(TP, na.rm = T) - sum(TC, na.rm = T)), 
            but_p = sum(BP),
            but_c = sum(BC),
            tir_ca_p = sum(TCP, na.rm = T),
            tir_ca_c = sum(TCC, na.rm = T),
            tir_p = sum(TP, na.rm = T),
            tir_c = sum(TC, na.rm = T)) %>%
  arrange((season), desc(pts), desc(diff_but))

然后我应用上面提到的框架:

l1_0517 <- l1_0517 %>% 
  mutate(

    # First, see how many goals the team scores relative to the average
    norm_attack = but_p %>% divide_by(mean(but_p)) %>% 
      # Then, transform it into an unconstrained scale
      log(),
    # First, see how many goals the team concedes relative to the average
    norm_defense = but_c %>% divide_by(mean(but_c)) %>% 
      # Invert it, so a higher defense is better
      raise_to_power(-1) %>% 
      # Then, transform it into an unconstrained scale
      log(),

    # Now that we have normalized attack and defense ratings, we can compute
    # measures of quality and attacking balance

    quality = norm_attack + norm_defense,
    balance = norm_attack - norm_defense
  ) %>%
arrange(desc(norm_attack))

当我查看专栏norm_attack时,我希望找到相同but_p值的相同值,而不是这里的情况:

head(l1_0517, 10)

例如当but_p的值为83,第5行和第7行时,我分别在norm_attack0.5612738获得0.5128357

这是正常的吗?我期望mean(l1_0517$but_p)被修复,因此当l1_0517$but_p的值被对数标准化时,获得相同的结果?

更新

我尝试过一个更简单的例子,但我不能重现这个问题:

df <- tibble(a = as.integer(runif(200, 15, 100)))

df <- df %>%
  mutate(norm_a = a %>% divide_by(mean(a)) %>%
           log())

1 个答案:

答案 0 :(得分:1)

在查看l1_0517

的类型后,我找到了解决方案

这是grouped_df,因此结果不同。

正确的代码是:

l1_0517 <- tmp3 %>%
  group_by(season, team)%>%
  summarise(j = n(),
            pts = sum(Pts),
            diff_but = (sum(BP) - sum(BC)),
            diff_t_ca = (sum(TCP, na.rm = T) - sum(TCC, na.rm = T)),
            diff_t = (sum(TP, na.rm = T) - sum(TC, na.rm = T)), 
            but_p = sum(BP),
            but_c = sum(BC),
            tir_ca_p = sum(TCP, na.rm = T),
            tir_ca_c = sum(TCC, na.rm = T),
            tir_p = sum(TP, na.rm = T),
            tir_c = sum(TC, na.rm = T)) %>%
  ungroup() %>%
  arrange((season), desc(pts), desc(diff_but))