根据变量条件更改列类型

时间:2017-11-10 12:40:32

标签: r dplyr tidyverse tibble

我有数据,这是一个小样本:

df <- structure(list(`d955` = c("1", "4", NA, NA), 
                `65c2` = c("6a08", NA, "6a08", "6a09")), 
                 class = c("tbl_df", "tbl", "data.frame"), 
                 row.names = c(NA, -4L), .Names = c("d955", "65c2"))
# A tibble: 4 x 2
#    d955 `65c2`
#   <chr>  <chr>
# 1     1   6a08
# 2     4   <NA>
# 3  <NA>   6a08
# 4  <NA>   6a09

两列都是字符类型。我想将包含数字从1到5的所有列的columntype更改为整数。我知道我可以手工挑选列来做到这一点,但因为列会不断变化,所以这不是一个令人满意的选择。

那么如何自动完成?我一直在查看mutate_if包中的dplyr,但我不知道如何选择正确的列开头。

我一直在调查可能有效的str_detect,但像str_detect(df, "[1234]")这样的内容也会匹配65c2行中的字符串,数字介于1-4之间。我一直在寻找str_count的解决方案,因为整数总是有1,但我找不到基于stringcount条件选择列的好方法......

所需的自动结果:

# A tibble: 4 x 2
#    d955 `65c2`
#   <int>  <chr>
# 1     1   6a08
# 2     4   <NA>
# 3  <NA>   6a08
# 4  <NA>   6a09

3 个答案:

答案 0 :(得分:3)

通过基础R的想法,

i1 <- colSums(sapply(df, function(i) i %in% c(NA, 1:5))) == nrow(df)
df[i1] <- lapply(df[i1], as.integer)

给出,

str(df)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   4 obs. of  2 variables:
 $ d955: int  1 4 NA NA
 $ 65c2: chr  "6a08" NA "6a08" "6a09"

你也可以把它变成一个功能,

my_conversion <- function(df){
  i1 <- colSums(sapply(df, function(i) i %in% c(NA, 1:5))) == nrow(df)
  df[i1] <- lapply(df[i1], as.integer)
  return(df)
}

答案 1 :(得分:3)

使用mutate_if包中的dplyr的解决方案。我们需要为此任务定义谓词函数(is_one_five_only)。

library(dplyr)

# Design a function to determine if elements from one vector are all 1 to 5
# Notice that if the entire column is NA, it will report FALSE
is_one_five_only <- function(x){
  if (all(is.na(x))){
    return(FALSE)
  } else {
    x2 <- x[!is.na(x)]
    return(all(x2 %in% 1:5))
  }
}

# Apply is_one_five_only as the predicate function in mutate_if
df2 <- df %>% mutate_if(is_one_five_only, as.integer)
df2

# # A tibble: 4 x 2
#   d955 `65c2`
#   <int>  <chr>
# 1     1   6a08
# 2     4   <NA>
# 3    NA   6a08
# 4    NA   6a09

答案 2 :(得分:1)

使用data.table

library(data.table)
setDT(df)

# get indices of all the character columns
# (i.e. we can skip numeric/other columns)
char_cols = sapply(df, is.character)

# := is the assignment operator in data.table --
#  since data.table is built for efficiency,
#  this differs from base R or dplyr assignment
# since assignment with := is _by reference_,
#  meaning no copies are created. there are other
#  advantages of :=, like simple assignment
#  by group -- see the intro vignettes
#.SD is a reflexive reference -- if .SDcols
#  is unspecified, it simply refers to your
#  data.table itself -- df[ , .SD] is the same as df.
#  .SDcols is used to restrict which columns are
#  included in this Subset of the Data -- here,
#  we only include character columns.
#Finally, by lapply-ing .SD, we essentially loop
#  over the specified columns to apply our
#  custom-tailored function
df[ (char_cols) := lapply(.SD, function(x) {
  if (any(grepl('[^1-5]', x))) x
  else as.integer(x)
}, .SDcols = char_cols]

希望转换逻辑清晰;可以根据需要详细说明。

请参阅Getting Started wiki获取初级数据和其他大量资源,以便让自己适应data.table的要点。