Question

嗨，我有一个导入到r的文件，我想重新编码其中的一列

Number of People
1 to 3
4 to 6 
7 to 10
.
.
.
.

“人数”列共有30多个级别。我想做的就是将它们转换为数值（即“ 1到3”变成“ 2”，“ 4到6”变成“ 5”）

由于我要处理大量数据，是否有更有效的方法来对此进行重新编码，还是只有在使用recode（）时才可能？

谢谢！

Answer 1

这是一个基于dplyr的解决方案，其基本结构与Chris Ruehlemann的答案相同

library(dplyr)
library(stringr)

df <- data.frame(Number_of_People = c("1 to 3",
                                       "4 to 6",
                                       "7 to 10"))

df %>%
  mutate(first_numb = as.numeric(str_extract(Number_of_People, "^\\d{1,}")),
         second_numb = as.numeric(str_extract(Number_of_People, "\\d{1,}$"))) %>%
  rowwise() %>%
  mutate(avg = mean(c(first_numb, second_numb)))
# A tibble: 3 x 4
  Number_of_People first_numb second_numb   avg
  <fct>                 <dbl>       <dbl> <dbl>
1 1 to 3                    1           3   2  
2 4 to 6                    4           6   5  
3 7 to 10                   7          10   8.5

Answer 2

样本数据：

要获得所需的内容，首先需要提取所有数字，转换为数字类型，然后计算均值：

/p[4]

如果要将平均值作为数据框中的新列，请将结果存储为新变量：

df <- data.frame(
  Number_of_ppl = c("1 to 3", "40 to 45")
)

为您提供：

library(stringr)
sapply(lapply(str_extract_all(df$Number_of_ppl, "\\d+"), as.numeric), mean)
[1]  2.0 42.5

Answer 3

我们还可以使用separate将列分成两部分，然后获取列的mean

library(dplyr)
library(tidyr)
df %>% 
     separate(Number_of_People, into = c("first", "second"), sep="\\s*to\\s*",
           convert = TRUE, remove = FALSE) %>% 
     mutate(avg =  (first + second)/2)
#  Number_of_People first second avg
#1           1 to 3     1      3 2.0
#2           4 to 6     4      6 5.0
#3          7 to 10     7     10 8.5

数据

df <- data.frame(Number_of_People = c("1 to 3",
                                       "4 to 6",
                                       "7 to 10"))

有没有一种方法可以更有效地对此范围进行编码？

3 个答案:

数据