Question

我想生成一个向量，该向量具有向量a中存在的字母表中所有26个数字的总数。

a <- c("aabead", "dadfhhsa")

例如，此向量中的a将等于5，b为1，d为2，z为0，x为0等。

Answer 1

在table的帮助下，您只需要strsplit和unlist这些功能：

table(unlist(strsplit(a, ""), use.names=FALSE))
#
# a b d e f h s 
# 5 1 3 1 1 2 1

strsplit将字符串“爆炸”成单个字母。它为向量“a”中的每个字符串创建一个list，一个项目。
由于strsplit的输出为list，因此您需要unlist之前将其制成表格。 use.names = FALSE只会提高unlist速度。
table，正如您现在可能已经猜到的那样，将输出列表。

如果你真的想要零值，你需要在factor中加上letters，并在内置table(factor(unlist(strsplit(a, ""), use.names=FALSE), levels=letters)) # # a b c d e f g h i j k l m n o p q r s t u v w x y z # 5 1 0 3 1 1 0 2 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0常量的帮助下：

unlist(strsplit(...))

更新

当处理这些类型的问题时，您必须迭代大量值，重要的是要考虑如何解决问题。

在接受的答案中，例如，sapply被称为26次：每个字母一次。通过首先拆分和取消列出值，然后使用fun1a，您会发现性能显着提升。比较下面fun2a和factor的效果差异。

作为参考，我还使用tabulate对基于sapply的解决方案和替代方案进行了基准测试。可以看出，这些比使用library(stringi) set.seed(1) n <- 100000 a <- stri_rand_strings(n, sample(10, n, TRUE), "[a-z]") fun1a <- function() sapply(letters, function(x) x<-sum(x==unlist(strsplit(a,"")))) fun1b <- function() { temp <- unlist(strsplit(a, "")) sapply(letters, function(x) { sum(x == temp) }) } fun2 <- function() table(factor(unlist(strsplit(a, "", TRUE), use.names=FALSE), levels=letters)) fun3 <- function() { `names<-`(tabulate( factor(unlist(strsplit(a, "", TRUE), use.names = FALSE), letters), nbins = 26), letters) } library(microbenchmark) microbenchmark(fun1a(), fun1b(), fun2(), fun3(), times = 10) # Unit: milliseconds # expr min lq mean median uq max neval # fun1a() 1025.45449 1177.90226 1189.49551 1190.11137 1238.66071 1352.05645 10 # fun1b() 102.94881 114.08700 115.14852 115.87184 119.06776 124.64735 10 # fun2() 53.46341 58.67832 67.50402 68.94933 70.71005 95.10771 10 # fun3() 46.65357 49.79365 51.68536 51.55922 54.36390 57.07582 10循环单个字母要快得多。

{{1}}

Answer 2

您可以使用letters R内置向量

以这种方式执行此操作

 > sapply(letters, function(x) x<-sum(x==unlist(strsplit(a,""))))
a b c d e f g h i j k l m n o p q r s t u v w x y z 
5 1 0 3 1 1 0 2 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0

计算字符串向量中每个字母的数量

2 个答案:

更新