我有此数据:
library(tidyverse)
df <- tibble(
"racecmb" = c("White", "White", "White", "White", "White", "White",
"White", "White", "Black", "White", "Mixed",
"Black", "White", "White", "White"),
"age" = c(77,74,55,62,60,59,32,91,75,73,43,67,58,18,57),
"income" = c("10 to under $20,000", "100 to under $150,000",
"75 to under $100,000", "75 to under $100,000",
"10 to under $20,000", "20 to under $30,000",
"100 to under $150,000", "20 to under $30,000",
"100 to under $150,000", "20 to under $30,000",
"100 to under $150,000", "Less than $10,000",
"$150,000 or more", " 30 to under $40,000",
"50 to under $75,000"),
"party" = c("Independent", "Independent", "Independent", "Democrat",
"Independent", "Republican", "Independent",
"Independent", "Democrat", "Republican", "Republican",
"Democrat", "Democrat", "Independent", "Independent"),
"ideology" = c("Moderate", "Moderate", "Conservative", "Moderate",
"Moderate", "Very conservative", "Moderate",
"Conservative",
"Conservative", "Moderate", "Conservative",
"Very conservative", "Liberal", "Moderate", "Conservative")
)
我想要(已经尝试过)运行简单的多元回归:
regression <- lm(party ~ income + ideo + age, data = df) %>%
summary()
我收到此错误:
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
NA/NaN/Inf in 'y'
我的目标是解释某些人的投票方式,但我看不出如何为我的模型有效地编码数据。
任何评论/建议都值得赞赏...
答案 0 :(得分:2)
因此,首先,对分类变量使用lm()
是不理想的。您要使用的是rpart()
,它将为您提供类别或类的输出,或者您可以使用多项式logit /概率回归来返回在某些条件下发生结果的概率。
要安装的软件包:rpart和statisticsModeling
如果没有分类响应变量,则可以将分类变量转换为虚拟变量,然后运行包含虚拟变量的回归(记住将其中一个作为基线)。
这可以使用fastDummies
包来快速实现:
示例: df <- dummy_cols(df, select_columns = "ideology")
如果样本量很大,那么您可能还需要考虑模型变量之间的交互作用!