Question

我想估计一个离散选择模型。我有一个数据集，其中包含人员，他们在t_1的当前选择，他们在t_2的选择以及所有可能的选择。由于可能的选择范围太大，因此我需要进行抽样，以便每个人的选择集中有30个选择。它必须采样而不替换，并且任何人都不能在选择集中拥有重复的选项。 t_2的实际选择和t_1的选择都必须是选择集的一部分。现在，我正在尝试使用虚构数据进行此类操作。

library(data.table)
#Create the fictional data up to the current choice.
choices<-c(1:100) #vector of possible choices   
people<-data.frame(ID=1:10)
setDT(people,key="ID")
people[,"current_choice":=sample(choices,1),by="ID"] #what the person uses now
people[,"chosen":=sample(choices,1),by="ID"] #what the person actually picked at t_2



#expand the dataset to be 30 rows per person and create a choice ID
people<-people[rep(1:.N,30),]
setDT(people,key="ID")    
people[,"choice_id":=seq_len(.N), by="ID"]

#The current choice at t_1 needs to be in the choice set
people[1,"choice_set":=current_choice,by="ID"]

#The actual choice needs to be in the choice set
people[choice_id==2&current_choice!=chosen,"choice_set":= chosen,by="ID"]

#I want the remaining choices to be sampled from the vector of choices, but here is where I'm stuck
people[is.na(choice_set),"choice_set":=sample(choices,1),by="ID]

最后一行不会阻止每个人重复选择，包括重复当前选择的选择。

我考虑过使用expand.grid创建当前选择和潜在选择的所有组合，为它们分配一个随机的统一编号，为具有当前选择或实际选择的行分配更大的数字，排序，然后保留前30行。问题是我实际的10000个人和50000个选择耗尽了内存。

我应该如何处理？

编辑：在马特（Matt）的第一个回答之后，我仍然对选择集中的重复选择产生疑问。我一直在尝试用以下方法解决它们：

library(data.table)
#Create the fictional data up to the current choice.
choices<-c(1:100) #vector of possible choices   
people<-data.frame(ID=1:10)
setDT(people,key="ID")
people[,current_choice:=sample(choices,1),by= .(ID)] #what the person uses now
people[,chosen:= sample(choices,1),by= .(ID)] #what the person actually picked at t_2

#expand the dataset to be 30 rows per person and create a choice ID
people<-people[rep(1:.N,30),]
setDT(people,key="ID")    
people[,choice_id:=seq_len(.N), by=.(ID)]

#The chosen alternative has to be in the choice set
people[choice_id==1L,choice_set:=chosen,by=.(ID) ]
people

#The current chosen alternative has to be in the choice set
people[current_choice!=chosen&choice_id==2L,choice_set:=current_choice,by=.(ID) ]
people

people[is.na(choice_set), choice_set := sample(setdiff(choices,unique(choice_set)), .N), by = .(ID)]

然后的问题是，对于那些在t_1再次在t_2上选择了当前选择的个人，我引入了一个缺失。

Answer 1

按照我的理解，这就是我已经解决过的问题，使用您已经提供的99％代码（在这里和那里进行了一些美学语法调整，主要是删除了列分配周围不需要的引号，并使用了{{1} }中data.table语句中方便的.(...)语法也消除了这些引号）。

我认为对您有帮助的主要事情是基础by中的setdiff()函数（请通过运行R查看帮助文件），以确保填充前两行后，?base::setdiff和current_choice的值将从采样中排除，以填充剩余的行。

chosen

如何使用data.table为离散选择模型创建抽样选择集？

1 个答案: