选举/普查数据的多元线性回归及由此产生的误差

时间:2018-07-10 20:52:46

标签: r dplyr regression

我有此数据:

library(tidyverse)

df <- tibble(
  "racecmb" = c("White", "White", "White", "White", "White", "White", 
            "White", "White", "Black", "White", "Mixed", 
            "Black", "White", "White", "White"),
  "age" = c(77,74,55,62,60,59,32,91,75,73,43,67,58,18,57),
  "income" = c("10 to under $20,000", "100 to under $150,000", 
           "75 to under $100,000",  "75 to under $100,000",
           "10 to under $20,000", "20 to under $30,000",
           "100 to under $150,000", "20 to under $30,000",
           "100 to under $150,000", "20 to under $30,000",
           "100 to under $150,000", "Less than $10,000",
           "$150,000 or more", " 30 to under $40,000",
           "50 to under $75,000"),
  "party" = c("Independent", "Independent", "Independent", "Democrat", 
          "Independent", "Republican", "Independent", 
          "Independent", "Democrat", "Republican", "Republican", 
          "Democrat", "Democrat", "Independent", "Independent"),
 "ideology" = c("Moderate", "Moderate", "Conservative", "Moderate", 
             "Moderate", "Very conservative", "Moderate", 
             "Conservative", 
             "Conservative", "Moderate", "Conservative", 
             "Very conservative", "Liberal", "Moderate", "Conservative")
             )

我想要(已经尝试过)运行简单的多元回归:

regression <- lm(party ~ income + ideo + age, data = df) %>%
   summary()

我收到此错误:

Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 
NA/NaN/Inf in 'y'

我的目标是解释某些人的投票方式,但我看不出如何为我的模型有效地编码数据。

任何评论/建议都值得赞赏...

1 个答案:

答案 0 :(得分:2)

因此,首先,对分类变量使用lm()是不理想的。您要使用的是rpart(),它将为您提供类别或类的输出,或者您可以使用多项式logit /概率回归来返回在某些条件下发生结果的概率。

要安装的软件包:rpart和statisticsModeling

如果没有分类响应变量,则可以将分类变量转换为虚拟变量,然后运行包含虚拟变量的回归(记住将其中一个作为基线)。

这可以使用fastDummies包来快速实现:

示例: df <- dummy_cols(df, select_columns = "ideology")

如果样本量很大,那么您可能还需要考虑模型变量之间的交互作用!