在矩阵列中拆分字符串并计算新向量中的单个元素

时间:2014-05-13 22:18:06

标签: r

我有一个这样的数据集来自一个读入R的文件,带有read.table():

Nr Result1
 1 "A203,A305,A409,B309,B424,B545"
 2 "A190,A203,A305,B309,B425,B545"
 3 "A203,A305,A410,B280,B309,B425,B545"

Result1是一个字符串,我想在","计算每行中每个元素的出现次数。 我想计算不同的元素,并以这种格式写出结果:

A190 A203 A305 A409 A410 B280 B309 B424 B425 B545
1    3    3    1    1    1    3    1    2    3

我的第一个想法是遍历每一行,将字符串拆分为单个元素,使用第一组元素创建向量,对于第二行,检查匹配元素是否已存在(count + 1)或追加未知元素to count with count = 1。

我是R的新手并且会欣赏一些示例代码或提示如何使用R函数实现单个步骤! 非常感谢

3 个答案:

答案 0 :(得分:6)

我认为这就是你要找的东西:

newvector <- table(unlist(strsplit(as.character(df$Result1), ",")))

结果(存储在newvector中):

#>newvector
#A190 A203 A305 A409 A410 B280 B309 B424 B425 B545 
#   1    3    3    1    1    1    3    1    2    3 

strsplit函数在每个逗号处分割字符串向量(Result1)。结果是data.frame的每一行的列表(在我的示例中为df)。要将此列表转换为矢量,请使用unlist。然后table函数创建一个包含频率的表。

答案 1 :(得分:1)

这是另一种考虑因素:

development version of my "splitstackshape" package有一个名为concat.split.expanded的功能,&#34;扩展&#34;值以二进制表示形式显示其相关列位置。然后,您可以删除第一列并获取列总和(使用colSums)。

以下是一些示例数据:

mydf <- data.frame(Nr = 1:3,
                   Result1 = c("A190,A203,A305,B309,B425,B545", 
                               "A203,A305,A409,B309,B424,B545", 
                               "A203,A305,A410,B280,B309,B425,B545"))

以下是我建议的实施方案:

## library(devtools)
## install_github("splitstackshape", "mrdwab", ref = "devel")
library(splitstackshape) ## Install "devel" version from github
colSums(concat.split.expanded(mydf, "Result1", ",", type="character", 
                              fill=0, drop = TRUE)[-1])
# Result1_A190 Result1_A203 Result1_A305 Result1_A409 Result1_A410 Result1_B280 
#            1            3            3            1            1            1 
# Result1_B309 Result1_B424 Result1_B425 Result1_B545 
#            3            1            2            3 

从长远来看,这实际上会比table

更快
mydf <- do.call(rbind, replicate(30000, mydf, FALSE))
dim(mydf)
# [1] 90000     2

fun1 <- function() {
  colSums(concat.split.expanded(mydf, "Result1", ",", 
                                type="character", fill=0, drop = TRUE)[-1])
} 
fun2 <- function() table(unlist(strsplit(as.character(mydf$Result1), ",")))
system.time(out1 <- fun1())
#    user  system elapsed 
#    6.10    0.00    6.09 
system.time(out2 <- fun2())
#    user  system elapsed 
#    0.77    0.00    0.76 
all.equal(as.vector(out1), as.vector(out2))
# [1] TRUE

答案 2 :(得分:0)

使用表格(unlist(strsplit(yourstring,&#34;,&#34;))) 例: 如果你的字符串是example =&#34; A203,A305,A409,B309,B424,B545&#34;,下面的代码将产生:

R:> example = "A203,A305,A409,B309,B424,B545"
R:> table(unlist(strsplit(example, ",")))

A203 A305 A409 B309 B424 B545 
 1    1    1    1    1    1 

更完整的例子是:

R:> ex <- data.frame(rbind(c(1, "A203,A305,A409,B309,B424,B545"), 
                           c(2, "A190,A203,A305,B309,B425,B545"),  
                           c(3, "A203,A305,A410,B280,B309,B425,B545")))
R:> ex
  X1                                 X2
1  1      A203,A305,A409,B309,B424,B545
2  2      A190,A203,A305,B309,B425,B545
3  3 A203,A305,A410,B280,B309,B425,B545
R:> names(ex) <- c("Nr", "Result1")
R:> ex
  Nr                            Result1
1  1      A203,A305,A409,B309,B424,B545
2  2      A190,A203,A305,B309,B425,B545
3  3 A203,A305,A410,B280,B309,B425,B545

R:> typeof(ex$Nr)
[1] "integer"
R:> typeof(ex$Result1)
[1] "integer"
R:> ex$Result1 <- as.character(ex$Result1)
R:> ex
  Nr                            Result1
1  1      A203,A305,A409,B309,B424,B545
2  2      A190,A203,A305,B309,B425,B545
3  3 A203,A305,A410,B280,B309,B425,B545
R:> table(unlist(strsplit(ex$Result[1], ",")))

A203 A305 A409 B309 B424 B545 
   1    1    1    1    1    1 

对整个列(Result1)使用sapply(),一次全部。见下文。

R:> sapply(ex$Result1, function(x) {table(unlist(strsplit(x, ",")))})
$`A203,A305,A409,B309,B424,B545`

A203 A305 A409 B309 B424 B545 
  1    1    1    1    1    1 

$`A190,A203,A305,B309,B425,B545`

A190 A203 A305 B309 B425 B545 
  1    1    1    1    1    1 

$`A203,A305,A410,B280,B309,B425,B545`

A203 A305 A410 B280 B309 B425 B545 
  1    1    1    1    1    1    1