我在数据框中有100个分类变量,我想为我的预测模型创建交互。我创建了一个循环来完成它,但我最终得到重复。
df <- data.frame(Col1=c("A","B","C"),
Col2=c("F","G","H"),
Col3=c("X","Y","Z"))
这给了我们:
Col1 Col2 Col3
1 A F X
2 B G Y
3 C H Z
当我运行代码以使用
创建交互变量时vars <- colnames(df)
for (i in vars) {
for (j in vars) {
if (i != j) {
df[,c(paste0(i, j))] <- paste(df[[i]],df[[j]],sep='*')}}}
我最终得到的副本如Col1Col2与Col2Col1相同。
> str(df)
'data.frame': 3 obs. of 9 variables:
$ Col1 : Factor w/ 3 levels "A","B","C": 1 2 3
$ Col2 : Factor w/ 3 levels "F","G","H": 1 2 3
$ Col3 : Factor w/ 3 levels "X","Y","Z": 1 2 3
$ Col1Col2: chr "A*F" "B*G" "C*H"
$ Col1Col3: chr "A*X" "B*Y" "C*Z"
$ Col2Col1: chr "F*A" "G*B" "H*C"
$ Col2Col3: chr "F*X" "G*Y" "H*Z"
$ Col3Col1: chr "X*A" "Y*B" "Z*C"
$ Col3Col2: chr "X*F" "Y*G" "Z*H"
有没有办法删除这些重复项?
答案 0 :(得分:2)
您无需为每对变量创建显式交互列。相反,模型公式中的Col1 * Col2
会自动生成交互。例如,如果您的结果变量是y
(这将是数据框中的列),并且您希望回归公式包含其他列之间的所有双向交互,则可以执行以下操作:
form = reformulate(apply(combn(names(df)[-grep("y", names(df))], 2), 2, paste, collapse="*"), "y")
form
y ~ Col1 * Col2 + Col1 * Col3 + Col2 * Col3
那么你的回归模型将是:
mod = lm(form, data=df)
答案 1 :(得分:0)
您问题的可能答案: How to automatically include all 2-way interactions in a glm model in R
You can do two-way interactions simply using `.*.` and arbitrary n-way interactions writing `.^n`. `formula(g)` will tell you the expanded version of the formula in each of these cases.
答案 2 :(得分:0)
一个选项可能是使用combn
和apply
函数。一个自定义函数需要打印由*
分隔的两个分类值(例如A*F
)。
# data
df <- data.frame(Col1=c("A","B","C"),
Col2=c("F","G","H"),
Col3=c("X","Y","Z"))
#function to paste two values together in A*F format
multiplyit <- function(x){
paste(x, collapse = "*")
}
# Call combn using apply
df2 <- t(apply(df, 1, combn, 2, multiplyit))
#generate and set column names of df2
colnames(df2) <- paste("Col", combn(1:3, 2, paste, collapse="Col"), sep="")
#combine df and df2 to get the final df
df_final <- cbind(df, df2)
df_final
# Col1 Col2 Col3 Col1Col2 Col1Col3 Col2Col3
#1 A F X A*F A*X F*X
#2 B G Y B*G B*Y G*Y
#3 C H Z C*H C*Z H*Z