ddply正在重新排序我的数据

时间:2014-07-20 13:24:25

标签: r normalization plyr

我正在尝试使用ddply对数据进行标准化(由变量'season'进行子集化)但是它将季节变量插入到我正常化的数据前面(第4列),然后将所有数据转移到正确的。

我是dplyr / plyr世界的新手所以感谢任何帮助。

完全可重复的例子:

library(plyr)
library(dplyr)
library(XML)
library(stringr)

# File Names, Functions, Parameters, etc. 
# custom functions
normalize <- function(x) { 
  return((x - min(x)) / (max(x) - min(x)))
}

trim <- function (x) gsub("^\\s+|\\s+$", "", x)

first_season <- 2004
last_season <- 2013
num_seasons <- as.numeric(last_season - first_season + 1)

seasons <- seq(first_season, last_season, by=1)
rm(first_season, last_season)

# Passing 
passing <- data.frame()
for (i in 1:num_seasons) {
  url <- paste("http://www.pro-football-reference.com/years/", seasons[i],"/passing.htm", sep = "")
  df <- readHTMLTable(url,which=1)
  df$season = seasons[i]
  df <- df[!names(df) %in% c("QBrec") ] 
  if(df$season >= 2008) df <- df[!names(df) %in% c("QBR") ] # Removes QBR 2008+
  passing <- rbind(passing, df)
  rm(df)
  print(seasons[i])
}

names(passing) <- c("rank_pfr", "nameinfo", "team", "age", "games", "games_started",
                    #"qb_record", 
                    "completions", "attempts", "comp_pct", "yards_passing",
                    "td_passing", "td_pct", "interceptions", "int_pct", "long_passing",
                    "yards_pass_att", "yards_pass_att_avg", "yards_pass_comp", "yards_pass_game", "pass_rate", "sacks", "sacks_pass", "yards_net_pass_att", "yards_net_pass_att_avg", "sacks_pct", "comebacks", "game_win_drives", "season")

passing <- passing[which(passing$rank_pfr!='Rk'), ]

passing[, 4:28] <- apply(passing[,4:28], 2, as.numeric) 

passing[is.na(passing)] <- 0

# Note that season is the last column (both colname and viewing the data)
# colnames(passing)
# View(passing)

passing[, 4:28] <- plyr::ddply(passing[, 4:28], .(season), colwise(normalize))

# Note that season still *appears* to be the last column
# colnames(passing)

# But when you view the data the season values have been
# inserted under age, and everything else seems to be shifted to the right
# View(passing)

谢谢!

1 个答案:

答案 0 :(得分:3)

我认为你描述的是一个正常的&#34;分组plyr导致的.(season)行为。例如,您可以对mtcars数据集执行相同操作并比较结果。

head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

请注意carb的位置。

使用normalize功能:

ddply(mtcars, .(carb), colwise(normalize))
#   carb        mpg cyl       disp         hp       drat         wt       qsec  vs  am gear
#1     1 0.29746835   0 0.19743178 0.62222222 0.74657534 0.29846154 0.00000000 NaN   1  1.0
#2     1 0.20886076   1 1.00000000 1.00000000 0.21917808 0.84923077 0.51552795 NaN   0  0.0
#3     1 0.00000000   1 0.82343499 0.88888889 0.00000000 1.00000000 1.00000000 NaN   0  0.0
#4     1 0.90506329   0 0.04066346 0.02222222 0.90410959 0.22461538 0.53416149 NaN   1  1.0
#5     1 1.00000000   0 0.00000000 0.00000000 1.00000000 0.00000000 0.80124224 NaN   1  1.0

使用基本功能进行双重检查

ddply(mtcars, .(carb), colwise(max))
  carb  mpg cyl  disp  hp drat    wt  qsec vs am gear
1    1 33.9   6 258.0 110 4.22 3.460 20.22  1  1    4
2    2 30.4   8 400.0 175 4.93 3.845 22.90  1  1    5
3    3 17.3   8 275.8 180 3.07 4.070 18.00  0  0    3
4    4 21.0   8 472.0 264 4.22 5.424 18.90  1  1    5
5    6 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5

因此,在这两种情况下,ddply都会按照分组变量为第一列并将所有其他变量向右移动的方式对结果data.frame进行排序。您还可以查看按.(carb, cyl)分组时发生的情况。

我建议你考虑使用dplyr这是一个较新的软件包来处理data.frames。与您的代码等效的dplyr将是:

library(dplyr)
passing <- passing %>%
          group_by(season) %>%
          mutate_each(funs(normalize), -c(1:4))

列1:4是您不想标准化的列。

运行

mtcars %>% group_by(carb) %>% mutate_each(funs(normalize))

您可以看到dplyr没有对列进行重新排序。

旁注:

要创建season变量,您只需使用

即可
season <- 2004:2013

season <- first_season:last_season

而且,当我运行您的代码时,大多数列都属于factor类。你用

passing[, 4:28] <- apply(passing[,4:28], 2, as.numeric)  

将它们转换为数字,但如果转化前的数据包含因素,就像我看到的那样,您应该使用as.numeric(as.character(...))进行正确转换。

希望有所帮助。