计数出现次数,字符串顺序仅计数1x

时间:2019-04-17 17:37:14

标签: r match

具有更准确的数据集示例的修订问题

我有几个不同的列表,每个列表包含许多字符。我在这里写了一个很简短的例子

List1 <- "A + B + C + D + E:F + F:E"

List2<- "A + B + C + E:F + F:E + G:H + H:G"

List3 <- "J + K + L + L:H + L:H1"

我正在尝试通过所有这些列表查找出现的频率,但是某些项目的重复会引起问题。

通过很多循环,然后Y X%in%Y被拆分(在“:”之前和之后拆分),我得到了

 sig_var8
     var count
 1     0     0
 2     A     2
 3     B     2
 4     C     2
 5     D     1
 6   E:F     2
 7   F:E     2
 8   G:H     1
 9   H:G     1
 10    J     1
 11    K     1
 12    L     1
 13  L:H     1
 14 L:H1     1

我想要的是

sig_var8
     var count
 1     0     0
 2     A     2
 3     B     2
 4     C     2
 5     D     1
 6   E:F     2
 7   G:H     1
 8     J     1
 9     K     1
 10    L     1
 11  L:H     1
 12 L:H1     1

注意:在列表1中,E:F和F:E被认为是相同的,并且只出现一次。与列表2相同,其中G:H == H:G,并且仅计数一次。请注意,grep并不是最好的,因为列表3中的L:H和L:H1不同,因此需要将它们分开计数(因此%in%)。

这是我工作的代码:

sig_var8<-data.frame(matrix(data=0,nrow=1,ncol=2))
colnames(sig_var8)<-c("var","count")
sig_var8[,1]<-as.character(sig_var8[,1])
sig_var8[,2]<-as.numeric(sig_var8[,2])


for(list in 1:3){
  temp_list<-get(paste0("List",list)) #get the equation above
  assign(paste0("List",list,"a"), gsub(" ","",temp_list)) #remove all spaces in the sentence
  assign(paste0("List",list,"a_split"), strsplit(get(paste0("List",list,"a")),"[+]")) #split where "+" are
  temp_listA<-get(paste0("List",list,"a_split"))[[1]]
  for (item in 1:length(temp_listA)){
    if(isTRUE(temp_listA[item] %in% sig_var8[,1])){
      row_n<-which(sig_var8[,1]==temp_listA[item])
      sig_var8[row_n,2]<-sig_var8[row_n,2]+1
     }
     if(isFALSE(temp_listA[item] %in% sig_var8[,1])){
       row_n<-nrow(sig_var8)
       sig_var8[row_n+1,1]<-temp_listA[item]
       sig_var8[row_n+1,2]<-1
    }
  }
 }

3 个答案:

答案 0 :(得分:3)

也许像下面这样可以满足您的需求。

Lst <- mget(ls(pattern = "^List"))

Lst <- lapply(Lst, function(x) {
  L <- strsplit(x, ":")
  res <- sapply(L, function(y){
    paste(sort(y), collapse = ":")
  })
  unique(res)
})

table(unlist(Lst))
#
#   A    B    C    D  E:F  G:H  H:L H1:L    J    K    L 
#   2    2    2    1    2    1    1    1    1    1    1 

答案 1 :(得分:1)

我不是100%确定这是您要寻找的东西,但是如果是,我会对其进行注释。

List1 <- c("A","B","C","D","E:F","F:E")
List2<- c("A","B","C","E:F","F:E","G:H","H:G")
List3 <- c("J","K","L","L:H","L:H1")

Lst <- list(List1, List2, List3)

keep_me <- lapply(Lst, function(x) !duplicated(lapply(strsplit(x, ":", fixed = T), sort)))
Lst_cleaned <- unlist(Map(`[`, Lst, keep_me))
table(Lst_cleaned)
Lst_cleaned
   A    B    C    D  E:F  G:H    J    K    L  L:H L:H1 
   2    2    2    1    2    1    1    1    1    1    1 

编辑:在下面添加了说明。让我知道是否仍然不清楚或遇到更多问题。我首先使用List1来演示lapply对每个列表元素的作用。另外,作为旁注,将其分解也使我意识到,如果您不想使用which,则无需使用。您可以使用Map中的逻辑向量对Lst

的元素进行子集化
# Spliting the string on the colon and sorting the elements
lapply(strsplit(List1, ":", fixed = T), sort)
[[1]]
[1] "A"

[[2]]
[1] "B"

[[3]]
[1] "C"

[[4]]
[1] "D"

[[5]]
[1] "E" "F"

[[6]]
[1] "E" "F"

# Logical vector for the elements are NOT duplicated
!duplicated(lapply(strsplit(List1, ":", fixed = T), sort))
[1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE

# Which gives the indices for TRUE's
which(!duplicated(lapply(strsplit(List1, ":", fixed = T), sort)))
[1] 1 2 3 4 5

# Now, all together: lapply is applying the above logic to 
# each elemnt in Lst, it returns a list of the indices that are not
# duplicates for each vector
lapply(Lst, function(x) which(!duplicated(lapply(strsplit(x, ":", fixed = T), sort))))
[[1]]
[1] 1 2 3 4 5

[[2]]
[1] 1 2 3 4 6

[[3]]
[1] 1 2 3 4 5

keep_me <- lapply(Lst, function(x) which(!duplicated(lapply(strsplit(x, ":", fixed = T), sort))))

# Map subsets (`[`) Lst by the indices in keep_me, and unlist  
# flattens the list (i.e., unlist makes it a vector)
Map(`[`, Lst, keep_me)
[[1]]
[1] "A"   "B"   "C"   "D"   "E:F"

[[2]]
[1] "A"   "B"   "C"   "E:F" "G:H"

[[3]]
[1] "J"    "K"    "L"    "L:H"  "L:H1"

unlist(Map(`[`, Lst, keep_me))
 [1] "A"    "B"    "C"    "D"    "E:F"  "A"    "B"    "C"    "E:F"  "G:H"  "J"    "K"    "L"    "L:H"  "L:H1"

答案 2 :(得分:1)

根据@Rui的回答,我认为这将满足您的要求

List1 <- c("A","B","C","D","E:F","F:E")
List2<- c("A","B","C","E:F","F:E","G:H","H:G")
List3 <- c("J","K","L","L:H","L:H1")

# make list of all objects starting with List
Lst <- mget(ls(pattern = "^List"))

# function to split, sort, and stitch the duplicates
split.sort <- function(x) {
  ifelse(length(x) > 1, paste0(sort(x), collapse = ":"), x)
}

# apply function to each of the Lst lists and remove duplicates
Lst <- lapply(Lst, function(y) unique(sapply(strsplit(y, ":"), split.sort)))

# get frequency
table(unlist(Lst))
#> 
#>    A    B    C    D  E:F  G:H  H:L H1:L    J    K    L 
#>    2    2    2    1    2    1    1    1    1    1    1

reprex package(v0.2.1)于2019-04-17创建