我有两个在公司工作的男女数据框。一列是15000行x 1000列,另一列是150 x1000。每一列代表一个属性(例如,薪水,身高等)。我正在比较每个支架中的女性和男性雇员(总共有五个)。
下面,我创建了一些伪数据和for循环。
#Create the data
num_of_employee = 100
f <- rep(c("Female"), 15)
m <- rep(c("Male"), 85)
Employee = paste("Employee",seq(1:num_of_employee))
Bracket = sample(seq(1,5,1),num_of_employee, replace = TRUE)
Height = sample(seq(65,100, 1),num_of_employee, replace = TRUE)
Weight = sample(seq(120,220, 1),num_of_employee, replace = TRUE)
Years_Employed = sample(seq(1,13, 1),num_of_employee, replace = TRUE)
Income = sample(seq(50000,200000, 1000),num_of_employee, replace = TRUE)
gender <- sample(append(f,m), replace = FALSE)
df1 = data.frame(Employee, Height, Weight, Years_Employed, Income, Bracket, gender)
women <-df1[df1$gender == 'Female',]
men <- df1[df1$gender == 'Male',]
这就是所有数据。现在,此for循环基本上逐列比较了男性和女性数据框。因此,例如,来自df1的收入将与来自df2的收入进行比较,例如身高,年限就业等...
v <-c()
runs <- 1000
for(j in 1:runs){
male_vector <- c()
female_vector <- c()
#loop through each of the 5 Brackets
for(z in 1:5){
#print out number of rows in each bracket.
number_of_rows <- length(which(women$Bracket == z))
#compare attributes of men and women within each bracket.
male_vector <- append(male_vector, men[sample(which(men$Bracket == z), number_of_rows), ]$Height)
female_vector <- append(female_vector, women[which(women$Bracket == z), ]$Height)
}
#Ask, are men and women different?
v <- append(v, sum(male_vector) > sum(female_vector))
}
#How many times are the men>women out of 1000?
as.numeric(sum(v))
[1] 70
因此此代码有效,但我想比较每列-表示Height
,Weight
,Years_Employed
和Income
。
我想输入两个数据帧,输出如下:
"Height " 0.223
"Salary: " 0.994
"Weight: " 0.006
"Years_Employed:" 0.325
.
.
.
"1000th column :" 0.013
请注意,我的实际数据由1000列组成,因此无法硬编码任何东西(我最初的方式)。
答案 0 :(得分:2)
以下内容比您的代码简单得多。
请注意,存在变相的循环,即split
和sapply
。但是代码更加简洁,并且避免了重复执行相同的计算。
如果您只是在运行代码之前调用set.seed(4358)
,那么结果将与此末尾的mean(v)
完全相同。
set.seed(4358) # Needed because of the call to sample()
runs <- 1000
v <- logical(runs)
df1_br <- split(df1, df1$Bracket)
df2_br <- split(df2, df2$Bracket)
female_vector <- sapply(df2_br, function(x) sum(x$Income))
sum_female_vector <- sum(female_vector)
number_of_rows <- sapply(df2_br, nrow)
for(j in 1:runs){
male_vector <- sapply(seq_along(df1_br), function(i) sum(sample(df1_br[[i]]$Income, number_of_rows[i], TRUE)))
v[j] <- sum(male_vector) > sum_female_vector
}
mean(v)
#[1] 0.933
样本数据。
我已经通过首先调用set.seed()
重新创建了数据集。
set.seed(6736)
num_of_employee = 15000
#Create their attributes
Employee <- paste("Employee", 1:num_of_employee)
Bracket <- sample(1:5, num_of_employee, replace = TRUE)
Height <- sample(65:100, num_of_employee, replace = TRUE)
Weight <- sample(120:220, num_of_employee, replace = TRUE)
Years_Employed <- sample(1:13, num_of_employee, replace = TRUE)
Income <- sample(seq(50000, 200000, 1000), num_of_employee, replace = TRUE)
gender <- sample(c("Female", "Male"), num_of_employee, prob = c(150, 14850)/15000, replace = TRUE)
#Finally make a dataframe for all their data
df1 = data.frame(Employee, Height, Weight, Years_Employed, Income, Bracket, gender)
#Split the dataframe by gender
df2 <- df1[df1$gender == 'Female', ]
df1 <- df1[df1$gender == 'Male', ]
编辑。
要使上面的代码接受任何列,请将其重写为函数。
compareGender <- function(Female, Male, what = "Income", Runs = 1000){
v <- logical(Runs)
Male_br <- split(Male, Male[["Bracket"]])
Female_br <- split(Female, Female[["Bracket"]])
female_vector <- sapply(Female_br, function(x) sum(x[[what]]))
sum_female_vector <- sum(female_vector)
number_of_rows <- sapply(Female_br, nrow)
for(j in seq_len(Runs)){
male_vector <- sapply(seq_along(Male_br), function(i) sum(sample(Male_br[[i]][[what]], number_of_rows[i], TRUE)))
v[j] <- sum(male_vector) > sum_female_vector
}
c(what = mean(v))
}
set.seed(4358) # To compare the result with the result above
compareGender(Female = df2, Male = df1)
#[1] 0.933
compareGender(Female = df2, Male = df1, what = "Height")
#[1] 0.012
compareGender(Female = df2, Male = df1, what = "Years_Employed")
#[1] 0.815
如果要将该功能自动应用于多个列,则可以使用*apply
函数。
在这种情况下,我将sapply
的功能移至第2至5列或names(df1)[2:5]
。
res <- sapply(names(df1)[2:5], function(x) compareGender(df2, df1, x))
names(res) <- sub("\\.what$", "", names(res))
res
#Height Weight Years_Employed Income
#0.012 0.211 0.827 0.948
现在,您可以将此输出转换为data.frame。有两种方法可以做到这一点。第一个创建一个具有一列并以names
属性作为行名的df。第二个创建具有两列的df,原始列名称和compareGender
返回的平均值。
final1 <- data.frame(Mean = res)
final1
# Mean
#Height 0.012
#Weight 0.211
#Years_Employed 0.827
#Income 0.948
final2 <- data.frame(Variable = names(res), Mean = res)
row.names(final2) <- NULL
final2
# Variable Mean
#1 Height 0.012
#2 Weight 0.211
#3 Years_Employed 0.827
#4 Income 0.948