转换为唯一且有序的数字字符串

时间:2015-11-19 13:57:05

标签: r

如果我的数据框中有一个用逗号分隔的数字字符串,那么如何将该字符串转换为另一列中有序且唯一的转换集?

Month  String_of_Nums  Converted
May    3,3,2           2,3
June   3,3,3,1         1,3
Sept   3,3,3, 3        3
Oct    3,3,3, 4        3,4
Jan    3,3,4           3,4
Nov    3,3,5,5         3,5

我尝试将数字串拆分为独特的工作

strsplit(df$String_of_Nums,",")

但我最终在字符列表中添加了空格。任何想法如何有效地生成转换列?还需要弄清楚如何操作列的所有元素等。

2 个答案:

答案 0 :(得分:2)

尝试:

df1 <- read.table(text="Month  String_of_Nums
May    '3,3,2'           
June   '3,3,3,1'         
Sept   '3,3,3,3'        
Oct    '3,3,3,4'        
Jan    '3,3,4'           
Nov    '3,3,5,5'", header = TRUE)

df1$converted <- apply(read.csv(text=as.character(df1$String_of_Nums), header = FALSE), 1, 
                       function(x) paste(sort(unique(x)), collapse = ","))

df1
  Month String_of_Nums converted
1   May          3,3,2       2,3
2  June        3,3,3,1       1,3
3  Sept        3,3,3,3         3
4   Oct       3,3,3, 4       3,4
5   Jan          3,3,4       3,4
6   Nov        3,3,5,5       3,5

答案 1 :(得分:2)

我想换个方向。据我所知,杰伊的例子有String_of_Nums因素。鉴于你说strsplit()有效,我假设你有String_of_Nums作为角色。在这里,我也将列作为字符。首先,拆分每个字符串(strsplit),找到唯一字符(unique),对字符(sort)进行排序,然后粘贴它们(toString)。此时,您有一个列表。您想使用as_vector包中的purrr转换列表中的向量。有趣的是,我使用基准测试来了解创建矢量的性能(即Converted

library(magrittr)
library(purrr)

lapply(strsplit(mydf$String_of_Nums, split = ","),
           function(x) toString(sort(unique(x)))) %>% 
as_vector(.type = "character") -> mydf$out

#  Month String_of_Nums  out
#1   May          3,3,2 2, 3
#2  June        3,3,3,1 1, 3
#3  Sept        3,3,3,3    3
#4   Oct        3,3,3,4 3, 4
#5   Jan          3,3,4 3, 4
#6   Nov        3,3,5,5 3, 5


library(microbenchmark)
microbenchmark(
 jazz = lapply(strsplit(mydf$String_of_Nums, split = ","),
                   function(x) toString(sort(unique(x)))) %>% 
        as_vector(.type = "character"),

 jay = apply(read.csv(text=as.character(df1$String_of_Nums), header = FALSE), 1, 
                   function(x) paste(sort(unique(x)), collapse = ",")),

 times = 10000)

# expr      min       lq      mean    median        uq      max neval
# jazz  358.913  393.018  431.7382  405.9395  420.1735 54779.29 10000
#  jay 1099.587 1151.244 1233.5631 1167.0920 1191.5610 56871.45 10000

数据

  Month String_of_Nums
1   May          3,3,2
2  June        3,3,3,1
3  Sept        3,3,3,3
4   Oct        3,3,3,4
5   Jan          3,3,4
6   Nov        3,3,5,5

mydf <- structure(list(Month = c("May", "June", "Sept", "Oct", "Jan", 
"Nov"), String_of_Nums = c("3,3,2", "3,3,3,1", "3,3,3,3", "3,3,3,4", 
"3,3,4", "3,3,5,5")), .Names = c("Month", "String_of_Nums"), row.names = c(NA, 
-6L), class = "data.frame")