我只是在介绍性的R课程中,所以这可能是非常基础的。
我正在使用Outlook on Life数据集,我对收入感兴趣。受访者必须选择以下19种选择之一:
Less than $5,000
$5,000 to $7,499
$7,500 to $9,999
$10,000 to $12,499
$12,500 to $14,999
$15,000 to $19,999
$20,000to $24,999
$25,000 to $29,999
$30,000 to $34,999
$35,000 to $39,999
$40,000 to $49,999
$50,000 to $59,999
$60,000 to $74,999
$75,000 to $84,999
$85,000 to $99,999
$100,000 to $124,999
$125,000 to $149,999
$150,000 to $174,999
$175,000 or more
我想折叠并将其简化为以下内容,以使图更易于理解:
我将如何重新编码?
谢谢!
答案 0 :(得分:2)
重新编码因子的最简单方法是认识到levels
函数可以接受可用于重新映射因子级别的值列表。
我认为你的数据已经是一个因素(如你所说"受访者必须选择以下19种选择之一")这意味着使用它并不是真的有意义cut
功能。
这是一个简单的例子:
z <- gl(3, 2, 12) # [1] 1 1 2 2 3 3 1 1 2 2 3 3, Levels: 1 2 3
levels(z) <- list(A = c(1,3), B = 2)
z # [1] A A B B A A A A B B A A, Levels: A B
从上面的示例中可以看出,我们已将第1级和第3级重新编码为A组,将第2级重新编码为B组。因此,您的问题可以通过类似的方式完成:
groups <- as.factor(sample(c("Less than $5,000",
"$5,000 to $7,499",
"$7,500 to $9,999",
"$10,000 to $12,499",
"$12,500 to $14,999",
"$15,000 to $19,999",
"$20,000to $24,999",
"$25,000 to $29,999",
"$30,000 to $34,999",
"$35,000 to $39,999",
"$40,000 to $49,999",
"$50,000 to $59,999",
"$60,000 to $74,999",
"$75,000 to $84,999",
"$85,000 to $99,999",
"$100,000 to $124,999",
"$125,000 to $149,999",
"$150,000 to $174,999",
"$175,000 or more"), size=100, replace=T))
levels(groups) <- list(
"Under poverty line"=c("Less than $5,000",
"$5,000 to $7,499",
"$7,500 to $9,999",
"$10,000 to $12,499",
"$12,500 to $14,999",
"$15,000 to $19,999",
"$20,000to $24,999"),
"Working class"=c("$25,000 to $29,999",
"$30,000 to $34,999"),
"Lower middle class"=c("$35,000 to $39,999",
"$40,000 to $49,999",
"$50,000 to $59,999"),
"Middle class"=c("$60,000 to $74,999",
"$75,000 to $84,999",
"$85,000 to $99,999"),
"Upper middle class"=c("$100,000 to $124,999",
"$125,000 to $149,999"),
"Top 5 percent"=c("$150,000 to $174,999",
"$175,000 or more")
)