嗨,我有一个导入到r的文件, 我想重新编码其中的一列
Number of People
1 to 3
4 to 6
7 to 10
.
.
.
.
“人数”列共有30多个级别。 我想做的就是将它们转换为数值(即“ 1到3”变成“ 2”,“ 4到6”变成“ 5”)
由于我要处理大量数据,是否有更有效的方法来对此进行重新编码,还是只有在使用recode()时才可能?
谢谢!
答案 0 :(得分:2)
这是一个基于dplyr
的解决方案,其基本结构与Chris Ruehlemann的答案相同
library(dplyr)
library(stringr)
df <- data.frame(Number_of_People = c("1 to 3",
"4 to 6",
"7 to 10"))
df %>%
mutate(first_numb = as.numeric(str_extract(Number_of_People, "^\\d{1,}")),
second_numb = as.numeric(str_extract(Number_of_People, "\\d{1,}$"))) %>%
rowwise() %>%
mutate(avg = mean(c(first_numb, second_numb)))
# A tibble: 3 x 4
Number_of_People first_numb second_numb avg
<fct> <dbl> <dbl> <dbl>
1 1 to 3 1 3 2
2 4 to 6 4 6 5
3 7 to 10 7 10 8.5
答案 1 :(得分:1)
样本数据:
p
要获得所需的内容,首先需要提取所有数字,转换为数字类型,然后计算均值:
/p[4]
如果要将平均值作为数据框中的新列,请将结果存储为新变量:
df <- data.frame(
Number_of_ppl = c("1 to 3", "40 to 45")
)
为您提供:
library(stringr)
sapply(lapply(str_extract_all(df$Number_of_ppl, "\\d+"), as.numeric), mean)
[1] 2.0 42.5
答案 2 :(得分:0)
我们还可以使用separate
将列分成两部分,然后获取列的mean
library(dplyr)
library(tidyr)
df %>%
separate(Number_of_People, into = c("first", "second"), sep="\\s*to\\s*",
convert = TRUE, remove = FALSE) %>%
mutate(avg = (first + second)/2)
# Number_of_People first second avg
#1 1 to 3 1 3 2.0
#2 4 to 6 4 6 5.0
#3 7 to 10 7 10 8.5
df <- data.frame(Number_of_People = c("1 to 3",
"4 to 6",
"7 to 10"))