编辑

Question

我有两个在公司工作的男女数据框。一列是15000行x 1000列，另一列是150 x1000。每一列代表一个属性（例如，薪水，身高等）。我正在比较每个支架中的女性和男性雇员（总共有五个）。

下面，我创建了一些伪数据和for循环。

#Create the data
num_of_employee = 100
f <- rep(c("Female"), 15)
m <- rep(c("Male"), 85)

Employee = paste("Employee",seq(1:num_of_employee))
Bracket = sample(seq(1,5,1),num_of_employee, replace = TRUE)
Height = sample(seq(65,100, 1),num_of_employee, replace = TRUE)
Weight = sample(seq(120,220, 1),num_of_employee, replace = TRUE)
Years_Employed = sample(seq(1,13, 1),num_of_employee, replace = TRUE)
Income = sample(seq(50000,200000, 1000),num_of_employee, replace = TRUE)
gender <- sample(append(f,m), replace = FALSE)
df1 = data.frame(Employee, Height, Weight, Years_Employed, Income, Bracket, gender)

women <-df1[df1$gender == 'Female',]
men <- df1[df1$gender == 'Male',]

这就是所有数据。现在，此for循环基本上逐列比较了男性和女性数据框。因此，例如，来自df1的收入将与来自df2的收入进行比较，例如身高，年限就业等...

v <-c()
runs <- 1000
for(j in 1:runs){
male_vector <- c()
female_vector <- c()

#loop through each of the 5 Brackets
for(z in 1:5){

#print out number of rows in each bracket. 
number_of_rows <- length(which(women$Bracket == z))

#compare attributes of men and women within each bracket.
male_vector <- append(male_vector, men[sample(which(men$Bracket == z), number_of_rows), ]$Height)
female_vector <- append(female_vector, women[which(women$Bracket == z), ]$Height)


 }
 #Ask, are men and women different?
  v <- append(v, sum(male_vector) > sum(female_vector))


}
#How many times are the men>women out of 1000?
as.numeric(sum(v))
[1] 70

因此此代码有效，但我想比较每列-表示Height，Weight，Years_Employed和Income。

编辑

我想输入两个数据帧，输出如下：

"Height " 0.223
"Salary: " 0.994
"Weight: " 0.006
"Years_Employed:"  0.325
.
.
.
"1000th column :" 0.013

请注意，我的实际数据由1000列组成，因此无法硬编码任何东西（我最初的方式）。

Answer 1

以下内容比您的代码简单得多。
请注意，存在变相的循环，即split和sapply。但是代码更加简洁，并且避免了重复执行相同的计算。

如果您只是在运行代码之前调用set.seed(4358)，那么结果将与此末尾的mean(v)完全相同。

set.seed(4358)    # Needed because of the call to sample()

runs <- 1000

v <- logical(runs)
df1_br <- split(df1, df1$Bracket)
df2_br <- split(df2, df2$Bracket)
female_vector <- sapply(df2_br, function(x) sum(x$Income))
sum_female_vector <- sum(female_vector)
number_of_rows <- sapply(df2_br, nrow)

for(j in 1:runs){
  male_vector <- sapply(seq_along(df1_br), function(i) sum(sample(df1_br[[i]]$Income, number_of_rows[i], TRUE)))
  v[j] <- sum(male_vector) > sum_female_vector
}

mean(v)
#[1] 0.933

样本数据。

我已经通过首先调用set.seed()重新创建了数据集。

set.seed(6736)

num_of_employee = 15000

#Create their attributes
Employee <- paste("Employee", 1:num_of_employee)
Bracket <- sample(1:5, num_of_employee, replace = TRUE)
Height <- sample(65:100, num_of_employee, replace = TRUE)
Weight <- sample(120:220, num_of_employee, replace = TRUE)
Years_Employed <- sample(1:13, num_of_employee, replace = TRUE)
Income <- sample(seq(50000, 200000, 1000), num_of_employee, replace = TRUE)
gender <- sample(c("Female", "Male"), num_of_employee, prob = c(150, 14850)/15000, replace = TRUE)

#Finally make a dataframe for all their data
df1 = data.frame(Employee, Height, Weight, Years_Employed, Income, Bracket, gender)
#Split the dataframe by gender
df2 <- df1[df1$gender == 'Female', ]
df1 <- df1[df1$gender == 'Male', ]

编辑。

要使上面的代码接受任何列，请将其重写为函数。

compareGender <- function(Female, Male, what = "Income", Runs = 1000){

  v <- logical(Runs)
  Male_br <- split(Male, Male[["Bracket"]])
  Female_br <- split(Female, Female[["Bracket"]])
  female_vector <- sapply(Female_br, function(x) sum(x[[what]]))
  sum_female_vector <- sum(female_vector)
  number_of_rows <- sapply(Female_br, nrow)

  for(j in seq_len(Runs)){
    male_vector <- sapply(seq_along(Male_br), function(i) sum(sample(Male_br[[i]][[what]], number_of_rows[i], TRUE)))
    v[j] <- sum(male_vector) > sum_female_vector
  }

  c(what = mean(v))
}

set.seed(4358)    # To compare the result with the result above
compareGender(Female = df2, Male = df1)
#[1] 0.933


compareGender(Female = df2, Male = df1, what = "Height")
#[1] 0.012

compareGender(Female = df2, Male = df1, what = "Years_Employed")
#[1] 0.815

如果要将该功能自动应用于多个列，则可以使用*apply函数。
在这种情况下，我将sapply的功能移至第2至5列或names(df1)[2:5]。

res <- sapply(names(df1)[2:5], function(x) compareGender(df2, df1, x))
names(res) <- sub("\\.what$", "", names(res))

res
#Height         Weight Years_Employed         Income 
#0.012          0.211          0.827          0.948

现在，您可以将此输出转换为data.frame。有两种方法可以做到这一点。第一个创建一个具有一列并以names属性作为行名的df。第二个创建具有两列的df，原始列名称和compareGender返回的平均值。

final1 <- data.frame(Mean = res)
final1
#                Mean
#Height         0.012
#Weight         0.211
#Years_Employed 0.827
#Income         0.948


final2 <- data.frame(Variable = names(res), Mean = res)
row.names(final2) <- NULL
final2
#        Variable  Mean
#1         Height 0.012
#2         Weight 0.211
#3 Years_Employed 0.827
#4         Income 0.948

使For循环更高效

编辑

1 个答案: