我正在尝试使用ddply对数据进行标准化(由变量'season'进行子集化)但是它将季节变量插入到我正常化的数据前面(第4列),然后将所有数据转移到正确的。
我是dplyr / plyr世界的新手所以感谢任何帮助。
完全可重复的例子:
library(plyr)
library(dplyr)
library(XML)
library(stringr)
# File Names, Functions, Parameters, etc.
# custom functions
normalize <- function(x) {
return((x - min(x)) / (max(x) - min(x)))
}
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
first_season <- 2004
last_season <- 2013
num_seasons <- as.numeric(last_season - first_season + 1)
seasons <- seq(first_season, last_season, by=1)
rm(first_season, last_season)
# Passing
passing <- data.frame()
for (i in 1:num_seasons) {
url <- paste("http://www.pro-football-reference.com/years/", seasons[i],"/passing.htm", sep = "")
df <- readHTMLTable(url,which=1)
df$season = seasons[i]
df <- df[!names(df) %in% c("QBrec") ]
if(df$season >= 2008) df <- df[!names(df) %in% c("QBR") ] # Removes QBR 2008+
passing <- rbind(passing, df)
rm(df)
print(seasons[i])
}
names(passing) <- c("rank_pfr", "nameinfo", "team", "age", "games", "games_started",
#"qb_record",
"completions", "attempts", "comp_pct", "yards_passing",
"td_passing", "td_pct", "interceptions", "int_pct", "long_passing",
"yards_pass_att", "yards_pass_att_avg", "yards_pass_comp", "yards_pass_game", "pass_rate", "sacks", "sacks_pass", "yards_net_pass_att", "yards_net_pass_att_avg", "sacks_pct", "comebacks", "game_win_drives", "season")
passing <- passing[which(passing$rank_pfr!='Rk'), ]
passing[, 4:28] <- apply(passing[,4:28], 2, as.numeric)
passing[is.na(passing)] <- 0
# Note that season is the last column (both colname and viewing the data)
# colnames(passing)
# View(passing)
passing[, 4:28] <- plyr::ddply(passing[, 4:28], .(season), colwise(normalize))
# Note that season still *appears* to be the last column
# colnames(passing)
# But when you view the data the season values have been
# inserted under age, and everything else seems to be shifted to the right
# View(passing)
谢谢!
答案 0 :(得分:3)
我认为你描述的是一个正常的&#34;分组plyr
导致的.(season)
行为。例如,您可以对mtcars
数据集执行相同操作并比较结果。
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
请注意carb
的位置。
使用normalize
功能:
ddply(mtcars, .(carb), colwise(normalize))
# carb mpg cyl disp hp drat wt qsec vs am gear
#1 1 0.29746835 0 0.19743178 0.62222222 0.74657534 0.29846154 0.00000000 NaN 1 1.0
#2 1 0.20886076 1 1.00000000 1.00000000 0.21917808 0.84923077 0.51552795 NaN 0 0.0
#3 1 0.00000000 1 0.82343499 0.88888889 0.00000000 1.00000000 1.00000000 NaN 0 0.0
#4 1 0.90506329 0 0.04066346 0.02222222 0.90410959 0.22461538 0.53416149 NaN 1 1.0
#5 1 1.00000000 0 0.00000000 0.00000000 1.00000000 0.00000000 0.80124224 NaN 1 1.0
使用基本功能进行双重检查
ddply(mtcars, .(carb), colwise(max))
carb mpg cyl disp hp drat wt qsec vs am gear
1 1 33.9 6 258.0 110 4.22 3.460 20.22 1 1 4
2 2 30.4 8 400.0 175 4.93 3.845 22.90 1 1 5
3 3 17.3 8 275.8 180 3.07 4.070 18.00 0 0 3
4 4 21.0 8 472.0 264 4.22 5.424 18.90 1 1 5
5 6 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5
因此,在这两种情况下,ddply
都会按照分组变量为第一列并将所有其他变量向右移动的方式对结果data.frame进行排序。您还可以查看按.(carb, cyl)
分组时发生的情况。
我建议你考虑使用dplyr
这是一个较新的软件包来处理data.frames。与您的代码等效的dplyr将是:
library(dplyr)
passing <- passing %>%
group_by(season) %>%
mutate_each(funs(normalize), -c(1:4))
列1:4是您不想标准化的列。
运行
mtcars %>% group_by(carb) %>% mutate_each(funs(normalize))
您可以看到dplyr
没有对列进行重新排序。
旁注:
要创建season
变量,您只需使用
season <- 2004:2013
或
season <- first_season:last_season
而且,当我运行您的代码时,大多数列都属于factor
类。你用
passing[, 4:28] <- apply(passing[,4:28], 2, as.numeric)
将它们转换为数字,但如果转化前的数据包含因素,就像我看到的那样,您应该使用as.numeric(as.character(...))
进行正确转换。
希望有所帮助。