我的数据很大,但是我想知道相似字符串的行数
df<- structure(list(x = structure(c(5L, 5L, 5L, 5L, 1L, 1L, 3L, 5L,
5L, 6L, 6L, 2L, 2L, 2L, 2L, 4L, 4L, 4L, 3L), .Label = c("AJ5ter2",
"al-1Tter2", "AY9ter2", "CY-Yter2", "LK2ter2", "YY49ter2"), class = "factor")), class = "data.frame", row.names = c(NA,
-19L))
期望输出如下所示
LK2ter2 1:4, 9:10
AJ5ter2 5:6
AY9ter2 7, 19
YY49ter2 10:11
al-1Tter2 12:15
CY-Yter2 16:18
答案 0 :(得分:3)
另一个使用data.table
library(data.table)
DT <- as.data.table(df)
DT[, .(index = paste(unique(range(.I)), collapse = ":")), by = .(x, rleid(x))
][, .(index = toString(index)), by = x]
# x index
#1: LK2ter2 1:4, 8:9
#2: AJ5ter2 5:6
#3: AY9ter2 7, 19
#4: YY49ter2 10:11
#5: al-1Tter2 12:15
#6: CY-Yter2 16:18
答案 1 :(得分:2)
您可以尝试以下方法:
z <- sapply(levels(df$x), function(x) which(x == df$x))
data.frame(key = names(z), index = sapply(z, paste, collapse = ", "), row.names = NULL)
key index
1 AJ5ter2 5, 6
2 al-1Tter2 12, 13, 14, 15
3 AY9ter2 7, 19
4 CY-Yter2 16, 17, 18
5 LK2ter2 1, 2, 3, 4, 8, 9
6 YY49ter2 10, 11
答案 2 :(得分:2)
这是dplyr
方法的一种方法。不确定要输出文本还是数字矢量
library(tidyverse)
df <- structure(list(x = structure(c(5L, 5L, 5L, 5L, 1L, 1L, 3L, 5L, 5L, 6L, 6L, 2L, 2L, 2L, 2L, 4L, 4L, 4L, 3L), .Label = c("AJ5ter2", "al-1Tter2", "AY9ter2", "CY-Yter2", "LK2ter2", "YY49ter2"), class = "factor")), class = "data.frame", row.names = c(NA, -19L))
df %>%
mutate(row_number = row_number()) %>%
group_by(x) %>%
summarise(row_nums = str_c(row_number, collapse = ","))
#> # A tibble: 6 x 2
#> x row_nums
#> <fct> <chr>
#> 1 AJ5ter2 5,6
#> 2 al-1Tter2 12,13,14,15
#> 3 AY9ter2 7,19
#> 4 CY-Yter2 16,17,18
#> 5 LK2ter2 1,2,3,4,8,9
#> 6 YY49ter2 10,11
由reprex package(v0.2.1)于2019-02-19创建
答案 3 :(得分:2)
使用tidyverse
和data.table
,您可以执行以下操作:
df %>%
rowid_to_column() %>%
group_by(x, rleid(x)) %>%
summarise(res = ifelse(min(rowid) != max(rowid),
paste(min(rowid), max(rowid), sep = ":"), paste(rowid))) %>%
group_by(x) %>%
summarise(res = paste(res, collapse = ", "))
x res
<fct> <chr>
1 AJ5ter2 5:6
2 al-1Tter2 12:15
3 AY9ter2 7, 19
4 CY-Yter2 16:18
5 LK2ter2 1:4, 8:9
6 YY49ter2 10:11
或者只是tidyverse
一样:
df %>%
rowid_to_column() %>%
group_by(x, x_rleid = {x_rleid = rle(as.numeric(x)); rep(seq_along(x_rleid$lengths), x_rleid$lengths)}) %>%
summarise(res = ifelse(min(rowid) != max(rowid),
paste(min(rowid), max(rowid), sep = ":"), paste(rowid))) %>%
group_by(x) %>%
summarise(res = paste(res, collapse = ", "))
两个代码都首先添加具有行ID的列。其次,它们按“ x”和游程长度组ID“ x”分组。第三,他们评估最小行ID是否等于最大行ID。如果不是,则将最小和最大行ID的值组合在一起,并用:
分隔,否则仅使用一个行ID值。最后,它们仅按“ x”分组,并按,
组合不同的元素。
或者如果您需要所有值,而不仅仅是范围:
df %>%
rowid_to_column() %>%
group_by(x, x_rleid = {x_rleid = rle(as.numeric(x)); rep(seq_along(x_rleid$lengths), x_rleid$lengths)}) %>%
summarise(res = paste(rowid, collapse = ",")) %>%
group_by(x) %>%
summarise(res = paste(res, collapse = ","))
x res
<fct> <chr>
1 AJ5ter2 5,6
2 al-1Tter2 12,13,14,15
3 AY9ter2 7,19
4 CY-Yter2 16,17,18
5 LK2ter2 1,2,3,4,8,9
6 YY49ter2 10,11