如何基于r中的有序向量替换列中的所有值

时间:2019-11-27 22:52:43

标签: r

我正在尝试用有序类别替换数据框中的所有数值。这是一个虚拟数据帧:

rethrow<1>("An exception for the odd numbers!");

请注意,所讨论的实际数据帧是使用df <- data.frame(a = c(1:100), b = sample(c(0,20), size = 100, replace = TRUE), c = c(1:100)) 导入的dta文件。实际数据帧可以在GSS here上找到。我正在处理2018年的文件,并希望将b中的所有值(即0到20)替换为以下一组类别:

haven::read_dta()

如果我对每个类别都使用educ_vec <- c("No formal schooling", "1st grade", "2nd grade", "3rd grade", "4th grade", "5th grade", "6th grade", "7th grade", "8th grade", "9th grade", "10th grade", "11th grade", "12th grade", "1 year of college", "2 years of college", "3 years of college", "4 years of college", "5 years of college", "6 years of college", "7 years of college", "8 years of college") educ_fac <- factor(educ_vec, ordered = TRUE, levels = educ_vec) mutate,则过程太长,并且无法保留ifelse中的顺序。我尝试了几种方法来一步完成此操作,但没有成功。 一种方法是:

educ_fac

其他两种方法相似,但均未成功

gss_df %>% 
  mutate(educ = fct_recode(educ, 
                           "No formal schooling" = 0, 
                           "1st grade" = 1, 
                           "2nd grade" = 2, 
                           "3rd grade" = 3, 
                           "4th grade" = 4, 
                           "5th grade" = 5, 
                           "6th grade" = 6, 
                           "7th grade" = 7, 
                           "8th grade" = 8, 
                           "9th grade" = 9, 
                           "10th grade" = 10, 
                           "11th grade" = 11, 
                           "12th grade" = 12, 
                           "1 year of college" = 13, 
                           "2 years of college" = 14, 
                           "3 years of college" = 15, 
                           "4 years of college" = 16, 
                           "5 years of college" = 17, 
                           "6 years of college" = 18, 
                           "7 years of college" = 19, 
                           "8 years of college" = 20))

Error: `f` must be a factor (or character vector or numeric vector).
gss_df %>% 
  mutate(educ = fct_recode(educ, educ_fac))

Error: `f` must be a factor (or character vector or numeric vector).

有人可以解决这个问题吗?

2 个答案:

答案 0 :(得分:1)

由于某些原因,我无法读取dta文件,因此下面我模拟数据向您展示我的建议。您从educ_vec向量开始。

educ_vec <- c("No formal schooling", "1st grade", 
"2nd grade", "3rd grade", "4th grade", "5th grade", 
"6th grade", "7th grade", "8th grade", "9th grade", 
"10th grade", "11th grade", "12th grade", "1 year of college", 
"2 years of college", "3 years of college", "4 years of college", 
"5 years of college", "6 years of college", "7 years of college", 
"8 years of college")

如果您查看educ_vec,它已经是您想要的格式

# this is meant for 0
educ_vec[1]
[1] "No formal schooling"
# this is meant for 20
educ_vec[21]
[1] "8 years of college"

如果您的分数为i,则新的分类值将为educ_vec [i + 1];因此我们可以在下面使用它:

set.seed(100)
gss_df <- data.frame(educ=sample(0:20,30,replace=TRUE))
gss_df %>% 
mutate(new=factor(educ_vec[educ+1],ordered = TRUE, levels = educ_vec))

   educ                new
1     9          9th grade
2     5          5th grade
3    15 3 years of college
4    18 6 years of college
5    13  1 year of college
6    11         11th grade
7     5          5th grade
8     3          3rd grade
9     5          5th grade
10    1          1st grade
11    6          6th grade
12    6          6th grade
13   10         10th grade
14   17 5 years of college
15   11         11th grade
16    2          2nd grade
17   18 6 years of college
18    7          7th grade
19   17 5 years of college
20    1          1st grade
21   18 6 years of college
22    3          3rd grade
23    3          3rd grade
24   19 7 years of college
25   15 3 years of college
26   20 8 years of college
27    6          6th grade
28   15 3 years of college
29   10         10th grade
30   19 7 years of college

是的,如果在数据中未找到某些因素,它将起作用:

gss_df <- data.frame(educ=0:5)%>%
mutate(new=factor(educ_vec[educ+1],ordered = TRUE, levels = educ_vec))

  educ                 new
1    0 No formal schooling
2    1           1st grade
3    2           2nd grade
4    3           3rd grade
5    4           4th grade
6    5           5th grade

您会看到新列是预期类别的一个因素。

str(gss_df)
'data.frame':   6 obs. of  2 variables:
 $ educ: int  0 1 2 3 4 5
 $ new : Ord.factor w/ 21 levels "No formal schooling"<..: 1 2 3 4 5 6

如果您的分数不在0到20之间,例如-1,-2或21,22等。那么我建议您执行以下操作:

names(educ_vec) = 0:20
gss_df <- data.frame(educ=c(-1,0,20,21))
# you can also use mutate
gss_df$new <- educ_vec[match(gss_df$educ,names(educ_vec))]
gss_df

  educ                 new
1   -1                <NA>
2    0 No formal schooling
3   20  8 years of college
4   21                <NA>

如果在您的educ_vec中找不到对应的名称,则匹配项将返回NA。

答案 1 :(得分:1)

解决该问题的另一种方法是使用命名向量,并在以后进行因子排序。一旦您将.dta文件读到工作区,就可以通过多种方式来解决此问题。

set.seed(777)
library(tidyverse)
df <- data.frame(a = c(1:100), b = sample(c(0:20), size = 100, replace = TRUE), c = c(1:100))

# -------------------------------------------------------------------------
head(df)
#   a  b c
# 1 1  0 1
# 2 2 18 2
# 3 3 11 3
# 4 4  9 4
# 5 5 11 5
# 6 6  8 6

# -------------------------------------------------------------------------

# this will be used as name istead
educ_vec <- c("No formal schooling", "1st grade", "2nd grade", "3rd grade", "4th grade", "5th grade", "6th grade", "7th grade", "8th grade", "9th grade", "10th grade", "11th grade", "12th grade", "1 year of college", "2 years of college", "3 years of college", "4 years of college", "5 years of college", "6 years of college", "7 years of college", "8 years of college")

# alues as char from 0 to 20
value_vec <- as.character(seq(21)-1)

# assign educ_vec as names 
names(value_vec) <- educ_vec

# fct_recode b
df$educ <- fct_recode(factor(df$b), !!!value_vec)

# set educ as ordered factor using educ_vec as levels
df$educ <- factor(df$educ, ordered = TRUE, levels = educ_vec)

# -------------------------------------------------------------------------
head(df)
#   a  b c                educ
# 1 1  0 1 No formal schooling
# 2 2 18 2  6 years of college
# 3 3 11 3          11th grade
# 4 4  9 4           9th grade
# 5 5 11 5          11th grade
# 6 6  8 6           8th grade

# -------------------------------------------------------------------------