Question

在R中重新编码变量似乎是我最头痛的问题。您使用哪些功能，包，流程来确保最佳结果？

我发现在互联网上很少有一些有用的例子可以为记录提供一个通用的解决方案，我很想知道你们和gals正在使用什么。

注意：这可能是社区维基主题。

Answer 1

重新编码可能意味着很多事情，并且从根本上说是复杂的。

可以使用levels函数更改因子的级别：

> #change the levels of a factor
> levels(veteran$celltype) <- c("s","sc","a","l")

转换连续变量只涉及矢量化函数的应用：

mtcars $ mpg.log＆lt; - log（mtcars $ mpg）

要对连续数据进行分级，请查看cut和cut2（在hmisc包中）。例如：

> #make 4 groups with equal sample sizes
> mtcars[['mpg.tr']] <- cut2(mtcars[['mpg']], g=4)
> #make 4 groups with equal bin width
> mtcars[['mpg.tr2']] <- cut(mtcars[['mpg']],4, include.lowest=TRUE)

为了将连续或因子变量重新编码为分类变量，汽车包中有recode，Deducer包中有recode.variables

> mtcars[c("mpg.tr2")] <- recode.variables(mtcars[c("mpg")] , "Lo:14 -> 'low';14:24 -> 'mid';else -> 'high';")

如果您正在寻找GUI，Deducer使用转换和重新编码对话框实现重新编码：

http://www.deducer.org/pmwiki/pmwiki.php?n=Main.TransformVariables

http://www.deducer.org/pmwiki/pmwiki.php?n=Main.RecodeVariables

Answer 2

我发现mapvalues包中的plyr非常方便。包还包含与revalue类似的函数car:::recode。

以下示例将“重新编码”

> mapvalues(letters, from = c("r", "o", "m", "a", "n"), to = c("R", "O", "M", "A", "N"))
 [1] "A" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "M" "N" "O" "p" "q" "R" "s" "t" "u" "v" "w" "x" "y" "z"

Answer 3

当应该转换几个值时，我发现这非常方便（就像在Stata中进行重新编码一样）：

# load package and gen some data
require(car)
x <- 1:10

# do the recoding
x
## [1]   1   2   3   4   5   6   7   8   9  10

recode(x,"10=1; 9=2; 1:4=-99")
## [1] -99 -99 -99 -99   5   6   7   8   2   1

Answer 4

我发现，在尝试更改它们之前，有时可以更容易地将非数字因子转换为字符。

df <- data.frame(example=letters[1:26]) 
example <- as.character(df$example)
example[example %in% letters[1:20]] <- "a"
example[example %in% letters[21:26]] <- "b"

此外，在导入数据时，在尝试转换之前确保数字实际上是数字非常有用：

df <- data.frame(example=1:100)
example <- as.numeric(df$example)
example[example < 20] <- 1
example[example >= 20 & example < 80] <- 2
example[example >= 80] <- 3

Answer 5

如果要重新计算因子的级别，forcats可能会派上用场。您可以阅读a chapter of R for Data Science获取详尽的教程，但这里有它的主旨。

<ion-slides class="image-slider" slidesPerView="3" pager="true">
        <ion-slide *ngFor="let item of sessionsObject" class="border">
        <button ion-item color="primary" id={{item.filmid}} class="bottom-slider">
            {{item.filmid}}
        </button>

    </ion-slide>
</ion-slides>

您甚至可以让R决定要合并的类别（要素级别）。

有时你只是想把所有小组混在一起，使情节或表更简单。这是fct_lump（）的工作。 [...]默认行为是逐步将最小的组合并在一起，确保聚合仍然是最小的组。

library(tidyverse)
library(forcats)
gss_cat %>%
  mutate(partyid = fct_recode(partyid,
                           "Republican, strong"    = "Strong republican",
                           "Republican, weak"      = "Not str republican",
                           "Independent, near rep" = "Ind,near rep",
                           "Independent, near dem" = "Ind,near dem",
                           "Democrat, weak"        = "Not str democrat",
                           "Democrat, strong"      = "Strong democrat",
                           "Other"                 = "No answer",
                           "Other"                 = "Don't know",
                           "Other"                 = "Other party"
  )) %>%
  count(partyid)
#> # A tibble: 8 × 2
#>                 partyid     n
#>                  <fctr> <int>
#> 1                 Other   548
#> 2    Republican, strong  2314
#> 3      Republican, weak  3032
#> 4 Independent, near rep  1791
#> 5           Independent  4119
#> 6 Independent, near dem  2499
#> # ... with 2 more rows

Answer 6

考虑这个示例数据。

df <- data.frame(a = 1:5, b = 5:1)
df
#  a b
#1 1 5
#2 2 4
#3 3 3
#4 4 2
#5 5 1

这里有两个选项 -

1. case_when：

对于单列 -

library(dplyr)

df %>%
  mutate(a = case_when(a == 1 ~ 'a', 
                       a == 2 ~ 'b', 
                       a == 3 ~ 'c', 
                       a == 4 ~ 'd', 
                       a == 5 ~ 'e'))

#  a b
#1 a 5
#2 b 4
#3 c 3
#4 d 2
#5 e 1

对于多列 -

df %>%
  mutate(across(c(a, b), ~case_when(. == 1 ~ 'a', 
                                    . == 2 ~ 'b', 
                                    . == 3 ~ 'c', 
                                    . == 4 ~ 'd', 
                                    . == 5 ~ 'e')))

#  a b
#1 a e
#2 b d
#3 c c
#4 d b
#5 e a

2. dplyr::recode：

对于单列 -

df %>%
  mutate(a = recode(a, '1' = 'a', '2' = 'b', '3' = 'c', '4' = 'd', '5' = 'e'))

对于多列 -

df %>%
  mutate(across(c(a, b), 
         ~recode(., '1' = 'a', '2' = 'b', '3' = 'c', '4' = 'd', '5' = 'e')))

用R重新编码变量

6 个答案: