我的数据如下所示:
df <- structure(list(V1 = structure(c(7L, 4L, 8L, 8L, 5L, 3L, 1L, 1L,
2L, 1L, 6L), .Label = c("", "cell and biogenesis;transport",
"differentiation;metabolic process;regulation;stimulus", "MAPK cascade;cell and biogenesis",
"MAPK cascade;cell and biogenesis;transport", "metabolic process;regulation;stimulus;transport",
"mRNA;stimulus;transport", "targeting"), class = "factor")), .Names = "V1", class = "data.frame", row.names = c(NA,
-11L))
我想计算有多少相似的字符串,但也有一个来自哪一行的轨道。每个字符串由;
分隔,但它们属于它们所在的行。
我希望输出如下:
String Count position
mRNA 1 1
stimulus 3 1,6,11
transport 4 1,5,9,11
MAPK cascade 2 2,5
cell and biogenesis 3 2,5,9
targeting 2 3,4
regulation of mRNA stability 1 1
regulation 2 6,11
differentiation 1 6,11
metabolic process 2 6,11
计数显示每个字符串(字符串由分号分隔)在整个数据中重复多少次。 第二列显示它们的位置,例如mRNA仅在第一行中,因此它是1.刺激是在第3行,第6行和第11行
有些行是空白的,它们也算作行。
答案 0 :(得分:4)
在下面的代码中,我们执行以下操作:
strplit
将每个字符串拆分为其组件,并将结果存储在名为string
的列中。strsplit
返回一个列表。我们使用unnest
来堆叠列表组件以创建一个&#34; long&#34;数据框,给我们一个整洁的&#34;准备汇总的数据框。string
分组并返回一个新数据框,该数据框计算每个字符串的频率,并给出原始行号,其中每个字符串实例最初出现。library(tidyverse)
df$V1 = as.character(df$V1)
df %>%
rownames_to_column() %>%
mutate(string = strsplit(V1, ";")) %>%
unnest %>%
group_by(string) %>%
summarise(count = n(),
rows = paste(rowname, collapse=","))
string count rows 1 cell and biogenesis 3 2,5,9 2 differentiation 1 6 3 MAPK cascade 2 2,5 4 metabolic process 2 6,11 5 mRNA 1 1 6 regulation 2 6,11 7 stimulus 3 1,6,11 8 targeting 2 3,4 9 transport 4 1,5,9,11
如果您计划对行号进行进一步处理,您可能希望将它们保留为数值,而不是作为粘贴值的字符串。在这种情况下,你可以这样做:
df.new = df %>%
rownames_to_column("rows") %>%
mutate(string = strsplit(V1, ";")) %>%
select(-V1) %>%
unnest
这将为您提供一个长数据框,每个string
和rows
的组合都有一行。
答案 1 :(得分:3)
基础R方法:
# convert 'V1' to a character vector (only necessary of it isn't already)
df$V1 <- as.character(df$V1)
# get the unique strings
strng <- unique(unlist(strsplit(df$V1,';')))
# create a list with the rows for each unique string
lst <- lapply(strng, function(x) grep(x, df$V1, fixed = TRUE))
# get the counts for each string
count <- lengths(lst)
# collpase the list string positions into a string with the rownumbers for each string
pos <- sapply(lst, toString)
# put everything together in one dataframe
d <- data.frame(strng, count, pos)
您可以将此方法缩短为:
d <- data.frame(strng = unique(unlist(strsplit(df$V1,';'))))
lst <- lapply(d$strng, function(x) grep(x, df$V1, fixed = TRUE))
transform(d, count = lengths(lst), pos = sapply(lst, toString))
结果:
> d
strng count pos
1 mRNA 1 1
2 stimulus 3 1, 6, 11
3 transport 4 1, 5, 9, 11
4 MAPK cascade 2 2, 5
5 cell and biogenesis 3 2, 5, 9
6 targeting 2 3, 4
7 differentiation 1 6
8 metabolic process 2 6, 11
9 regulation 2 6, 11
答案 2 :(得分:1)
可能的data.table
完整性解决方案
library(data.table)
setDT(df)[, .(.I, unlist(tstrsplit(V1, ";", fixed = TRUE)))
][!is.na(V2), .(count = .N, pos = toString(sort(I))),
by = .(String = V2)]
# String count pos
# 1: mRNA 1 1
# 2: MAPK cascade 2 2, 5
# 3: targeting 2 3, 4
# 4: differentiation 1 6
# 5: cell and biogenesis 3 2, 5, 9
# 6: metabolic process 2 6, 11
# 7: stimulus 3 1, 6, 11
# 8: transport 4 1, 5, 9, 11
# 9: regulation 2 6, 11
这基本上将V1
列拆分为;
,同时转换为长格式,同时将其与行索引(.I
)绑定。之后,它只是一个关于行计数(.N
)的简单聚合,并且每个String
将绑定位置合并为一个字符串。