我有一个这样的数据集来自一个读入R的文件,带有read.table():
Nr Result1
1 "A203,A305,A409,B309,B424,B545"
2 "A190,A203,A305,B309,B425,B545"
3 "A203,A305,A410,B280,B309,B425,B545"
Result1是一个字符串,我想在","计算每行中每个元素的出现次数。 我想计算不同的元素,并以这种格式写出结果:
A190 A203 A305 A409 A410 B280 B309 B424 B425 B545
1 3 3 1 1 1 3 1 2 3
我的第一个想法是遍历每一行,将字符串拆分为单个元素,使用第一组元素创建向量,对于第二行,检查匹配元素是否已存在(count + 1)或追加未知元素to count with count = 1。
我是R的新手并且会欣赏一些示例代码或提示如何使用R函数实现单个步骤! 非常感谢
答案 0 :(得分:6)
我认为这就是你要找的东西:
newvector <- table(unlist(strsplit(as.character(df$Result1), ",")))
结果(存储在newvector
中):
#>newvector
#A190 A203 A305 A409 A410 B280 B309 B424 B425 B545
# 1 3 3 1 1 1 3 1 2 3
strsplit
函数在每个逗号处分割字符串向量(Result1)。结果是data.frame的每一行的列表(在我的示例中为df
)。要将此列表转换为矢量,请使用unlist
。然后table
函数创建一个包含频率的表。
答案 1 :(得分:1)
这是另一种考虑因素:
development version of my "splitstackshape" package有一个名为concat.split.expanded
的功能,&#34;扩展&#34;值以二进制表示形式显示其相关列位置。然后,您可以删除第一列并获取列总和(使用colSums
)。
以下是一些示例数据:
mydf <- data.frame(Nr = 1:3,
Result1 = c("A190,A203,A305,B309,B425,B545",
"A203,A305,A409,B309,B424,B545",
"A203,A305,A410,B280,B309,B425,B545"))
以下是我建议的实施方案:
## library(devtools)
## install_github("splitstackshape", "mrdwab", ref = "devel")
library(splitstackshape) ## Install "devel" version from github
colSums(concat.split.expanded(mydf, "Result1", ",", type="character",
fill=0, drop = TRUE)[-1])
# Result1_A190 Result1_A203 Result1_A305 Result1_A409 Result1_A410 Result1_B280
# 1 3 3 1 1 1
# Result1_B309 Result1_B424 Result1_B425 Result1_B545
# 3 1 2 3
从长远来看,这实际上会比table
:
mydf <- do.call(rbind, replicate(30000, mydf, FALSE))
dim(mydf)
# [1] 90000 2
fun1 <- function() {
colSums(concat.split.expanded(mydf, "Result1", ",",
type="character", fill=0, drop = TRUE)[-1])
}
fun2 <- function() table(unlist(strsplit(as.character(mydf$Result1), ",")))
system.time(out1 <- fun1())
# user system elapsed
# 6.10 0.00 6.09
system.time(out2 <- fun2())
# user system elapsed
# 0.77 0.00 0.76
all.equal(as.vector(out1), as.vector(out2))
# [1] TRUE
答案 2 :(得分:0)
使用表格(unlist(strsplit(yourstring,&#34;,&#34;))) 例: 如果你的字符串是example =&#34; A203,A305,A409,B309,B424,B545&#34;,下面的代码将产生:
R:> example = "A203,A305,A409,B309,B424,B545"
R:> table(unlist(strsplit(example, ",")))
A203 A305 A409 B309 B424 B545
1 1 1 1 1 1
更完整的例子是:
R:> ex <- data.frame(rbind(c(1, "A203,A305,A409,B309,B424,B545"),
c(2, "A190,A203,A305,B309,B425,B545"),
c(3, "A203,A305,A410,B280,B309,B425,B545")))
R:> ex
X1 X2
1 1 A203,A305,A409,B309,B424,B545
2 2 A190,A203,A305,B309,B425,B545
3 3 A203,A305,A410,B280,B309,B425,B545
R:> names(ex) <- c("Nr", "Result1")
R:> ex
Nr Result1
1 1 A203,A305,A409,B309,B424,B545
2 2 A190,A203,A305,B309,B425,B545
3 3 A203,A305,A410,B280,B309,B425,B545
R:> typeof(ex$Nr)
[1] "integer"
R:> typeof(ex$Result1)
[1] "integer"
R:> ex$Result1 <- as.character(ex$Result1)
R:> ex
Nr Result1
1 1 A203,A305,A409,B309,B424,B545
2 2 A190,A203,A305,B309,B425,B545
3 3 A203,A305,A410,B280,B309,B425,B545
R:> table(unlist(strsplit(ex$Result[1], ",")))
A203 A305 A409 B309 B424 B545
1 1 1 1 1 1
对整个列(Result1)使用sapply(),一次全部。见下文。
R:> sapply(ex$Result1, function(x) {table(unlist(strsplit(x, ",")))})
$`A203,A305,A409,B309,B424,B545`
A203 A305 A409 B309 B424 B545
1 1 1 1 1 1
$`A190,A203,A305,B309,B425,B545`
A190 A203 A305 B309 B425 B545
1 1 1 1 1 1
$`A203,A305,A410,B280,B309,B425,B545`
A203 A305 A410 B280 B309 B425 B545
1 1 1 1 1 1 1