Question

我有一个非常大的数据集，其中有些列的格式设置为货币，某些数字和某些字符。读取数据时，所有货币列均被识别为因素，我需要将其转换为数字。数据集太宽，无法手动识别列。我正在尝试找到一种编程方式，以确定一列是否包含货币数据（例如以“ $”开头），然后传递要清除的那列列表。

name <- c('john','carl', 'hank')
salary <- c('$23,456.33','$45,677.43','$76,234.88')
emp_data <- data.frame(name,salary)

clean <- function(ttt){
as.numeric(gsub('[^a-zA-z0-9.]','', ttt))
}
sapply(emp_data, clean)

此示例中的问题在于，该方法适用于所有列，导致name列替换为NA。我需要一种方法，以编程方式仅识别需要将clean函数应用于的列。

Answer 1

使用dplyr和stringr包，您可以使用mutate_if来标识具有以$开头的任何字符串的列，然后相应地进行更改。

library(dplyr)
library(stringr)

emp_data %>%
  mutate_if(~any(str_detect(., '^\\$'), na.rm = TRUE),
            ~as.numeric(str_replace_all(., '[$,]', '')))

Answer 2

利用readr软件包提供的强大解析器的优势：

my_parser <- function(col) {
  # Try first with parse_number that handles currencies automatically quite well
  res <- suppressWarnings(readr::parse_number(col))
  if (is.null(attr(res, "problems", exact = TRUE))) {
    res
  } else {
    # If parse_number fails, fall back on parse_guess
    readr::parse_guess(col)
    # Alternatively, we could simply return col without further parsing attempt
  }
}

library(dplyr)

emp_data %>% 
  mutate(foo = "USD13.4",
         bar = "£37") %>% 
  mutate_all(my_parser)

#   name   salary  foo bar
# 1 john 23456.33 13.4  37
# 2 carl 45677.43 13.4  37
# 3 hank 76234.88 13.4  37

Answer 3

R的基本选项是使用startsWith检测美元列，并使用gsub从列中删除"$"和","。

doll_cols <- sapply(emp_data, function(x) any(startsWith(as.character(x), '$')))
emp_data[doll_cols] <- lapply(emp_data[doll_cols], 
                              function(x) as.numeric(gsub('\\$|,', '', x)))

R-确定哪些列包含货币数据$

3 个答案: