如果我的数据框中有一个用逗号分隔的数字字符串,那么如何将该字符串转换为另一列中有序且唯一的转换集?
Month String_of_Nums Converted
May 3,3,2 2,3
June 3,3,3,1 1,3
Sept 3,3,3, 3 3
Oct 3,3,3, 4 3,4
Jan 3,3,4 3,4
Nov 3,3,5,5 3,5
我尝试将数字串拆分为独特的工作
strsplit(df$String_of_Nums,",")
但我最终在字符列表中添加了空格。任何想法如何有效地生成转换列?还需要弄清楚如何操作列的所有元素等。
答案 0 :(得分:2)
尝试:
df1 <- read.table(text="Month String_of_Nums
May '3,3,2'
June '3,3,3,1'
Sept '3,3,3,3'
Oct '3,3,3,4'
Jan '3,3,4'
Nov '3,3,5,5'", header = TRUE)
df1$converted <- apply(read.csv(text=as.character(df1$String_of_Nums), header = FALSE), 1,
function(x) paste(sort(unique(x)), collapse = ","))
df1
Month String_of_Nums converted
1 May 3,3,2 2,3
2 June 3,3,3,1 1,3
3 Sept 3,3,3,3 3
4 Oct 3,3,3, 4 3,4
5 Jan 3,3,4 3,4
6 Nov 3,3,5,5 3,5
答案 1 :(得分:2)
我想换个方向。据我所知,杰伊的例子有String_of_Nums
因素。鉴于你说strsplit()
有效,我假设你有String_of_Nums
作为角色。在这里,我也将列作为字符。首先,拆分每个字符串(strsplit
),找到唯一字符(unique
),对字符(sort
)进行排序,然后粘贴它们(toString
)。此时,您有一个列表。您想使用as_vector
包中的purrr
转换列表中的向量。有趣的是,我使用基准测试来了解创建矢量的性能(即Converted
)
library(magrittr)
library(purrr)
lapply(strsplit(mydf$String_of_Nums, split = ","),
function(x) toString(sort(unique(x)))) %>%
as_vector(.type = "character") -> mydf$out
# Month String_of_Nums out
#1 May 3,3,2 2, 3
#2 June 3,3,3,1 1, 3
#3 Sept 3,3,3,3 3
#4 Oct 3,3,3,4 3, 4
#5 Jan 3,3,4 3, 4
#6 Nov 3,3,5,5 3, 5
library(microbenchmark)
microbenchmark(
jazz = lapply(strsplit(mydf$String_of_Nums, split = ","),
function(x) toString(sort(unique(x)))) %>%
as_vector(.type = "character"),
jay = apply(read.csv(text=as.character(df1$String_of_Nums), header = FALSE), 1,
function(x) paste(sort(unique(x)), collapse = ",")),
times = 10000)
# expr min lq mean median uq max neval
# jazz 358.913 393.018 431.7382 405.9395 420.1735 54779.29 10000
# jay 1099.587 1151.244 1233.5631 1167.0920 1191.5610 56871.45 10000
数据强>
Month String_of_Nums
1 May 3,3,2
2 June 3,3,3,1
3 Sept 3,3,3,3
4 Oct 3,3,3,4
5 Jan 3,3,4
6 Nov 3,3,5,5
mydf <- structure(list(Month = c("May", "June", "Sept", "Oct", "Jan",
"Nov"), String_of_Nums = c("3,3,2", "3,3,3,1", "3,3,3,3", "3,3,3,4",
"3,3,4", "3,3,5,5")), .Names = c("Month", "String_of_Nums"), row.names = c(NA,
-6L), class = "data.frame")