用千位后的逗号分离重构复杂因子向量

时间:2016-01-05 09:35:43

标签: r string vector character gsub

我想重新格式化因子向量,以便它包含的数字有一千个分隔符。向量包含整数和实数,没有关于值或顺序的任何特定规则。

数据

特别是,我正在使用类似于下面生成的向量vec

content <- c("0 - 100", "0 - 100", "0 - 100", "0 - 100",
             "150.22 - 170.33",
             "1000 - 2000","1000 - 2000", "1000 - 2000", "1000 - 2000", 
             "7000 - 10000", "7000 - 10000", "7000 - 10000", "7000 - 10000",
             "7000 - 10000", "1000000 - 22000000", "1000000 - 22000000", 
             "1000000 - 22000000",
             "44000000 - 66000000.8989898989")

vec <- factor(x = content, levels = unique(content))

期望的结果

我的目标是重新格式化此向量,以便数字包含类似Excel的 1,000 分隔符,如下例所示:

  

100.00   1,000.00
  1,000,000.00
  1,000,000.56
  24,564,000,000.56

尝试过的方法

我正在考虑使用gsubfn和一个可以传递数字的proto对象。然后可能创建另一个3位数的原型对象并替换。如下面的代码所示:

gsubfn(pattern = "[0-9][0-9][0-9]", replacement = ~paste0(x, ','), 
       x = as.character(vec))

这只是部分地起作用,因为逗号被插入:

  

“150,.22 - 170,.33”

这显然是错的。我还必须将角色向量转换为因子。我的问题经常归结为两个要素:

  • 如何解决逗号问题?
  • 如何保持因子的原始结构? - 我需要以与原始因子相同的方式排序因子向量,但在正确的位置使用逗号。

3 个答案:

答案 0 :(得分:1)

使用正向前瞻性正则表达式...

content <- c("0 - 100", "0 - 100", "0 - 100", "0 - 100",
              "1000 - 2000","1000 - 2000", "1000 - 2000", "1000 - 2000", 
              "7000 - 10000", "7000 - 10000", "7000 - 10000", "7000 - 10000",
              "7000 - 10000", "1000000 - 22000000", "1000000 - 22000000", 
              "1000000 - 22000000")
gsub("(\\d)(?=(?:\\d{3})+\\b)", "\\1,", content, perl=T)
# [1] "0 - 100"                "0 - 100"                "0 - 100"               
# [4] "0 - 100"                "1,000 - 2,000"          "1,000 - 2,000"         
# [7] "1,000 - 2,000"          "1,000 - 2,000"          "7,000 - 10,000"        
# [10] "7,000 - 10,000"         "7,000 - 10,000"         "7,000 - 10,000"        
# [13] "7,000 - 10,000"         "1,000,000 - 22,000,000" "1,000,000 - 22,000,000"
# [16] "1,000,000 - 22,000,000"

答案 1 :(得分:1)

也许你可以使用formatC

sapply(
  X = lapply(
    X = strsplit(x = content, split = " - "),
    FUN = function(x) {
      formatC(x = as.numeric(x), format = "f", flag = "#", big.mark = ",", 
              decimal.mark = ".", digits = 2, drop0trailing = FALSE)
    }
  ),
  FUN = paste, collapse = " - "
)
# [1] "0.00 - 100.00"                 "0.00 - 100.00"                 "0.00 - 100.00"                
# [4] "0.00 - 100.00"                 "150.22 - 170.33"               "1,000.00 - 2,000.00"          
# [7] "1,000.00 - 2,000.00"           "1,000.00 - 2,000.00"           "1,000.00 - 2,000.00"          
# [10] "7,000.00 - 10,000.00"          "7,000.00 - 10,000.00"          "7,000.00 - 10,000.00"         
# [13] "7,000.00 - 10,000.00"          "7,000.00 - 10,000.00"          "1,000,000.00 - 22,000,000.00" 
# [16] "1,000,000.00 - 22,000,000.00"  "1,000,000.00 - 22,000,000.00"  "44,000,000.00 - 66,000,000.90"

答案 2 :(得分:1)

仅在levels上运行 似乎可以保持您的精确度,而不是将您的向量转换为character向量,并且因为它减少了数据的大小,所以效率更高你只对唯一值(而不是整个向量)进行操作

levels(vec) <- sapply(strsplit(levels(vec), " - "), 
                       function(x) paste(prettyNum(x, 
                                            big.mark = ",", 
                                            preserve.width = "none"), 
                                   collapse = " - "))
vec
#  [1] 0 - 100                            0 - 100                            0 - 100                            0 - 100                            150.22 - 170.33                   
#  [6] 1,000 - 2,000                      1,000 - 2,000                      1,000 - 2,000                      1,000 - 2,000                      7,000 - 10,000                    
# [11] 7,000 - 10,000                     7,000 - 10,000                     7,000 - 10,000                     7,000 - 10,000                     1,000,000 - 22,000,000            
# [16] 1,000,000 - 22,000,000             1,000,000 - 22,000,000             44,000,000 - 66,000,000.8989898989
# Levels: 0 - 100 150.22 - 170.33 1,000 - 2,000 7,000 - 10,000 1,000,000 - 22,000,000 44,000,000 - 66,000,000.8989898989