R指定答案的概率

时间:2017-04-27 16:40:00

标签: r probability

我有100个任务和20个人来完成这些任务。我随机地为每个人分配7个任务(以便稍后计算评估者之间的协议)。所以我有一个dataID,其中包含personID,taskID。每项任务有5种可能的答案。我通过

模拟了“真实”答案(来自5个可能的答案)

truth <- sample(answers, no.tasks, replace = FALSE, prob = c(0.2, 0.2, 0.2, 0.2, 0.2))

并将其添加到我的dataSet,它现在有三列:personID,taskID,truth;

我还说50%的人表现不佳(穷人),10%的任务是艰巨的任务(hardtasks)。我假设每个概率:性能差,性能好,任务艰巨,任务简单

ppoor <- 0.7
pgood <- 0.99
phard <- 0.2
peasy <- 0.8

现在,我需要根据她在任务中的好坏来模拟这个人的答案,以及该任务是容易还是困难。对于我正在做的一个这样的组合:

for(i in 1:length(dataSet$taskID)) { ifelse(dataSet$personID[i] %in% poorperson && dataSet[dataSet$personID == i,]$taskID %in% hardtasks, probs[i] <- ppoor * phard, NA)}

但是,我没有达到我的期望,即我希望得到每个任务和工人组合的'probs'。我在这里做错了什么?

1 个答案:

答案 0 :(得分:1)

我无法理解您的代码,但我以一种希望开始讨论您需要向前发展的方式重新编写代码。如果您有任何问题,请告诉我们!

# Initialize 
no.tasks <- 100
no.workers <- 20 
tasksperworker <- no.tasks/no.workers # 5 each ( why did you have 7?)

# ANSWERS for Tasks
answers <- c("liver", "blood", "lung", "brain", "heart")
(truth <- sample(answers, no.tasks, replace = TRUE, prob = c(0.2, 0.2, 0.2, 0.2, 0.2)))

# TASKS
prop_hardtasks <- .10
hardtasks <- sample(1:no.tasks, prop_hardtasks * no.tasks)
easytasks <- setdiff(1:no.tasks, hardtasks)
phard <- 0.2
peasy <- 0.8
(task_Difficulty <- ifelse(1:no.tasks %in% easytasks, peasy, phard))

# WORKERS
prop_poorworkers <- .50 
poorworkers <- sample(1:no.workers, prop_poorworkers * no.workers)
goodworkers <- setdiff(1:no.workers, poorworkers)
ppoor <- 0.7
pgood <- 0.99
(worker_Ability <- ifelse(1:no.workers %in% goodworkers, pgood, ppoor))

# The dataset
# One step data creation
dataSet <- data.frame("workerID" = rep(1:no.workers, each = tasksperworker), 
                  "taskID" = 1:no.tasks, 
                  "truth" = truth, 
                  "taskDifficulty" = factor(task_Difficulty, labels = c("hard","easy")), 
                  "workerAbility" = factor(rep(worker_Ability, each = tasksperworker), labels = c("poor","good")), 
                  "probCorrect" = task_Difficulty * worker_Ability)

# I am coding out the old method as I believe it samples twice which is not necessary
# (assignmentMatrix <- replicate(no.workers, sample(1:no.tasks, tasksperworker, replace=FALSE))) 
#(assignEach <- reshape::melt.matrix(assignmentMatrix))
#(dataSet <- cbind.data.frame("workerID" = assignEach[,2], 
#                             "taskID" = assignEach[,3], 
#                             "truth" = truth[assignEach[,2]], 
#                             "taskDifficulty" = factor(task_Difficulty, labels = c("hard",'easy')), 
#                             "workerAbility" = factor(rep(worker_Ability,each = tasksperworker), labels = c("poor","good")), 
#                             "probCorrect" = task_Difficulty * worker_Ability))
根据要求

编辑

鉴于有5种可能性,每项任务都有一个事实(我们随机生成),我们希望保存工人得到正确答案的概率,同时为剩余的概率分配其他可能的答案。此向量可能会针对每个任务/工作人员进行更改。 sample(..., prob)参数必须始终等于1.因此,我继续执行以下操作:

# Initialize a matrix where each row contains the probability vector we will use to sample. 
truthProb <- matrix(NA, nrow = no.tasks, ncol = length(answers), dimnames = list(1:no.tasks, answers)) 

# run a for loop to populate it
for(i in 1:no.tasks){
  # Because Answer always changes, 
  # Find location of truth amongst answer vector using which
  # And place the probCorrect value into that spot
  truthProb[i, which(answers %in% dataSet$truth[i])] <- dataSet$probCorrect[i]
  # I would assign equal remaining probabilities to other incorrect answers.
  truthProb[i, -which(answers %in% dataSet$truth[i])] <- (1 - dataSet$probCorrect[i]) / (length(answers) - 1) 
}
rowSums(truthProb) # Should sum to 1

# Add their answers here by using apply to say, given a probability from each row (task), draw the single answer
dataSet$results <- apply(truthProb, 1, function(x) sample(answers, 1, replace = F, prob = x)) 

tail(truthProb) # see the probabilities for each task
tail(dataSet) # can compare the last bit

# how did we do? table showing results at each probCorrect (combination of difficulty and ability)
table(dataSet$truth, dataSet$results, dataSet$probCorrect) 

# double check again that this idea works as intended...
dataSet$truth[1]
truthProb[1,]
sum(1 * (dataSet$truth[1] == replicate(100, sample(answers, 1, replace = F, prob = truthProb[1,])))) /100

编辑所有功能版本:

# Simulate a Function!!
sim <- function(answers, no.tasks, no.workers, prop_hardtasks, prop_poorworkers, prob_hardeasy, prob_poorgood){
  # Initialize 
  tasksperworker <- no.tasks/no.workers
  # ANSWERS for Tasks
  truth <- sample(answers, no.tasks, replace = TRUE) # assumes equal probability of each answer
  # TASKS
  hardtasks <- sample(1:no.tasks, prop_hardtasks * no.tasks)
  easytasks <- setdiff(1:no.tasks, hardtasks)
  phard <- prob_hardeasy[1]
  peasy <- prob_hardeasy[2]
  (task_Difficulty <- ifelse(1:no.tasks %in% easytasks, peasy, phard))
  # WORKERS
  poorworkers <- sample(1:no.workers, prop_poorworkers * no.workers)
  goodworkers <- setdiff(1:no.workers, poorworkers)
  ppoor <- prob_poorgood[1]
  pgood <- prob_poorgood[2]
  (worker_Ability <- ifelse(1:no.workers %in% goodworkers, pgood, ppoor))
  # One step data creation
  dataSet <- data.frame("workerID" = rep(1:no.workers, each = tasksperworker), 
                        "taskID" = 1:no.tasks, "truth" = truth, "taskDifficulty" = factor(task_Difficulty, labels = c("hard","easy")), 
                        "workerAbility" = factor(rep(worker_Ability, each = tasksperworker), labels = c("poor","good")), 
                        "probCorrect" = task_Difficulty * rep(worker_Ability, each = tasksperworker))
  # SIMULATE ANSWER
  truthProb <- matrix(NA, nrow = no.tasks, ncol = length(answers), dimnames = list(1:no.tasks, answers)) 
  for(i in 1:no.tasks){
    truthProb[i, which(answers %in% dataSet$truth[i])] <- dataSet$probCorrect[i]
    truthProb[i, -which(answers %in% dataSet$truth[i])] <- (1 - dataSet$probCorrect[i]) / (length(answers) - 1) 
  }
  dataSet$results <- apply(truthProb, 1, function(x) sample(answers, 1, replace = F, prob = x)) 
  # Return
  return(dataSet)
}

dat <- sim(answers = LETTERS[1:5],  #  c("liver", "blood", "lung", "brain", "heart")
           no.tasks = 100, 
           prop_hardtasks = 10/100, 
           prob_hardeasy = c(.2, .8),
           no.workers = 20, 
           prop_poorworkers = 10/20, 
           prob_poorgood = c(.77, .99))
head(dat)
table(dat$truth, dat$results, dat$probCorrect)