Question

我有一个这样的数据集来自一个读入R的文件，带有read.table（）：

Nr Result1
 1 "A203,A305,A409,B309,B424,B545"
 2 "A190,A203,A305,B309,B425,B545"
 3 "A203,A305,A410,B280,B309,B425,B545"

Result1是一个字符串，我想在＆＃34;，＆＃34;计算每行中每个元素的出现次数。我想计算不同的元素，并以这种格式写出结果：

A190 A203 A305 A409 A410 B280 B309 B424 B425 B545
1    3    3    1    1    1    3    1    2    3

我的第一个想法是遍历每一行，将字符串拆分为单个元素，使用第一组元素创建向量，对于第二行，检查匹配元素是否已存在（count + 1）或追加未知元素to count with count = 1。

我是R的新手并且会欣赏一些示例代码或提示如何使用R函数实现单个步骤！非常感谢

Answer 1

我认为这就是你要找的东西：

newvector <- table(unlist(strsplit(as.character(df$Result1), ",")))

结果（存储在newvector中）：

#>newvector
#A190 A203 A305 A409 A410 B280 B309 B424 B425 B545 
#   1    3    3    1    1    1    3    1    2    3

strsplit函数在每个逗号处分割字符串向量（Result1）。结果是data.frame的每一行的列表（在我的示例中为df）。要将此列表转换为矢量，请使用unlist。然后table函数创建一个包含频率的表。

Answer 2

这是另一种考虑因素：

development version of my "splitstackshape" package有一个名为concat.split.expanded的功能，＆＃34;扩展＆＃34;值以二进制表示形式显示其相关列位置。然后，您可以删除第一列并获取列总和（使用colSums）。

以下是一些示例数据：

mydf <- data.frame(Nr = 1:3,
                   Result1 = c("A190,A203,A305,B309,B425,B545", 
                               "A203,A305,A409,B309,B424,B545", 
                               "A203,A305,A410,B280,B309,B425,B545"))

以下是我建议的实施方案：

## library(devtools)
## install_github("splitstackshape", "mrdwab", ref = "devel")
library(splitstackshape) ## Install "devel" version from github
colSums(concat.split.expanded(mydf, "Result1", ",", type="character", 
                              fill=0, drop = TRUE)[-1])
# Result1_A190 Result1_A203 Result1_A305 Result1_A409 Result1_A410 Result1_B280 
#            1            3            3            1            1            1 
# Result1_B309 Result1_B424 Result1_B425 Result1_B545 
#            3            1            2            3

从长远来看，这实际上会比table：

更快

mydf <- do.call(rbind, replicate(30000, mydf, FALSE))
dim(mydf)
# [1] 90000     2

fun1 <- function() {
  colSums(concat.split.expanded(mydf, "Result1", ",", 
                                type="character", fill=0, drop = TRUE)[-1])
} 
fun2 <- function() table(unlist(strsplit(as.character(mydf$Result1), ",")))
system.time(out1 <- fun1())
#    user  system elapsed 
#    6.10    0.00    6.09 
system.time(out2 <- fun2())
#    user  system elapsed 
#    0.77    0.00    0.76 
all.equal(as.vector(out1), as.vector(out2))
# [1] TRUE

Answer 3

使用表格（unlist（strsplit（yourstring，＆＃34;，＆＃34;）））例：如果你的字符串是example =＆＃34; A203，A305，A409，B309，B424，B545＆＃34;，下面的代码将产生：

R:> example = "A203,A305,A409,B309,B424,B545"
R:> table(unlist(strsplit(example, ",")))

A203 A305 A409 B309 B424 B545 
 1    1    1    1    1    1

更完整的例子是：

R:> ex <- data.frame(rbind(c(1, "A203,A305,A409,B309,B424,B545"), 
                           c(2, "A190,A203,A305,B309,B425,B545"),  
                           c(3, "A203,A305,A410,B280,B309,B425,B545")))
R:> ex
  X1                                 X2
1  1      A203,A305,A409,B309,B424,B545
2  2      A190,A203,A305,B309,B425,B545
3  3 A203,A305,A410,B280,B309,B425,B545
R:> names(ex) <- c("Nr", "Result1")
R:> ex
  Nr                            Result1
1  1      A203,A305,A409,B309,B424,B545
2  2      A190,A203,A305,B309,B425,B545
3  3 A203,A305,A410,B280,B309,B425,B545

R:> typeof(ex$Nr)
[1] "integer"
R:> typeof(ex$Result1)
[1] "integer"
R:> ex$Result1 <- as.character(ex$Result1)
R:> ex
  Nr                            Result1
1  1      A203,A305,A409,B309,B424,B545
2  2      A190,A203,A305,B309,B425,B545
3  3 A203,A305,A410,B280,B309,B425,B545
R:> table(unlist(strsplit(ex$Result[1], ",")))

A203 A305 A409 B309 B424 B545 
   1    1    1    1    1    1

对整个列（Result1）使用sapply（），一次全部。见下文。

R:> sapply(ex$Result1, function(x) {table(unlist(strsplit(x, ",")))})
$`A203,A305,A409,B309,B424,B545`

A203 A305 A409 B309 B424 B545 
  1    1    1    1    1    1 

$`A190,A203,A305,B309,B425,B545`

A190 A203 A305 B309 B425 B545 
  1    1    1    1    1    1 

$`A203,A305,A410,B280,B309,B425,B545`

A203 A305 A410 B280 B309 B425 B545 
  1    1    1    1    1    1    1

在矩阵列中拆分字符串并计算新向量中的单个元素

3 个答案: