从列中识别和剥离字符

时间:2021-06-23 09:30:29

标签: r character symbols gsub remove

我有一个大型数据集,我想在其中识别和删除字符和符号以仅保留数字值。 *例如我希望 -£1125.91m-1125.91

dataset
  Event                       var1       var2  
  <fct>                       <chr>      <chr> 
1 Labour Costs YoY            13.34m     0.026 
2 Unemployment Change (000's) $16.91b    -0.449
3 Unemployment Rate           -£1125.91m 0.89k 
4 Jobseekers Net Change       ¥1012.74b  9.56m

目前我知道如何从列中删除单个字符。像这样:

dataset$`var1` <- gsub("k", "", dataset$`var`)

手动执行此操作将需要大量工作,因为数据集非常大。 我想知道您是否可以同时识别和删除所有字符,以及货币符号和 m 和 b 吗?

复制数据集:

dataset <- structure(list(Event = structure(2:5, .Label = c("Event", "Labour Costs YoY", 
                                                    "Unemployment Change (000's)", "Unemployment Rate", "Jobseekers Net Change"), 
                                    .Names = c("", "", "", ""), class = "factor"), var1 = c("13.34m", "$16.91b", "-£1125.91m", "¥1012.74b"), var2 = c(0.026, -0.449, "0.89k", "9.56m")), row.names = c(NA, 
                                                                                                                                                                                                                 -4L), class = c("tbl_df", "tbl", "data.frame"))

先谢谢你!

1 个答案:

答案 0 :(得分:1)

要删除除连字符、数字或点以外的所有内容,您可以使用

dataset$var1 <- gsub("[^-0-9.]", "", dataset$var1)

[^-0-9.] 模式是一个否定字符类,它匹配除类中定义的字符之外的任何字符。

参见regex demo online

an online R demo

dataset <- structure(list(Event = structure(2:5, .Label = c("Event", "Labour Costs YoY", 
    "Unemployment Change (000's)", "Unemployment Rate", "Jobseekers Net Change"), 
   .Names = c("", "", "", ""), class = "factor"), var1 = c("13.34m", "$16.91b", "-£1125.91m", "¥1012.74b"), var2 = c(0.026, -0.449, "0.89k", "9.56m")), row.names = c(NA, 
   -4L), class = c("tbl_df", "tbl", "data.frame"))
gsub("[^-0-9.,]", "", dataset$var1)
##  => [1] "13.34"    "16.91"    "-1125.91" "1012.74"