Question

我想找到每个字符串中的大写字母，并计算每个字符串的数量例如

t = c("gctaggggggatggttactactGtgctatggactac", "gGaagggacggttactaCgTtatggactacT", "gcGaggggattggcttacG")  

ldply(str_match_all(t,"[A-Z]"),length)

应用上述功能时，我的输出是

1 4 2

但我的愿望输出是

[1] G -1

[2] G -1          C -1          T -2

[3] G -2

Answer 1

您可以提取所有大写字母，然后使用表格计算频率：

library(stringr)
lapply(str_extract_all(t, "[A-Z]"), table)
# [[1]]
# 
# G 
# 1 
# 
# [[2]]
# 
# C G T 
# 1 1 2 
# 
# [[3]]
# 
# G 
# 2

Answer 2

如果您将docendo的答案扩展为您确切要求的格式

lapply(stringr::str_extract_all(t, "[A-Z]"), 
       function(x) {
         x = table(x)
         paste(names(x), x, sep = "-")
       })

# [[1]]
# [1] "G-1"
# 
# [[2]]
# [1] "C-1" "G-1" "T-2"
# 
# [[3]]
# [1] "G-2"

以及我如何在tidyverse

中做到这一点

library(tidyverse)
data = data.frame(strings = c("gctaggggggatggttactactGtgctatggactac", "gGaagggacggttactaCgTtatggactacT", "gcGaggggattggcttacG"))
data  %>%
  mutate(caps_freq = stringr::str_extract_all(strings, "[A-Z]"),
         caps_freq = map(caps_freq, function(letter) data.frame(table(letter)))) %>%
  unnest()
#                                strings letters Freq
# 1 gctaggggggatggttactactGtgctatggactac       G    1
# 2      gGaagggacggttactaCgTtatggactacT       C    1
# 3      gGaagggacggttactaCgTtatggactacT       G    1
# 4      gGaagggacggttactaCgTtatggactacT       T    2
# 5                  gcGaggggattggcttacG       G    2

找到字符串中的大写字母

2 个答案: