向量的最短唯一子串

时间:2017-12-08 12:25:19

标签: r string substring

假设我有一个长度的字符向量,

vec <- c("man lives to work", "man works to live")

从头开始,我现在想在此向量中找到最短唯一子串(完整单词)。 换句话说,我并不是在寻找总体上最短的子字符串,但我希望裁剪字符串后面的字符串变为唯一字符串,在这种情况下是{{1}之后的字符串分别和work

所以结果应该是,在这种情况下:

lives

字符串应在 [1] "man lives" "man works" / lives后裁剪,因为这是它们成为唯一的最早点(在此上下文中)。 包括works将是多余的,因为它们已经是唯一的。 仅包含to是不够的,因为man不是唯一的。

(我想用它来自动生成有效的R名称,c("man", "man")将完成其余的工作。

我该怎么做?

我认为,已经成为一个已经执行此操作的程序包,但无法找到它。

2 个答案:

答案 0 :(得分:2)

作为一般策略,我会a)检查第一个单词是否是唯一的, b)如果没有,检查前两个单词是否唯一, c)继续,直到找到每个字符串的唯一解决方案。

您可以使用while循环或使用递归来实现此功能。以下是更新的示例(更新以保留订单):

library(stringi) ## makes string processing easier

vec <- c("man lives to work", "man works to live")

(word.mat <- stri_split_boundaries(vec,
                                   type = "word",
                                   skip_word_none = TRUE,
                                   simplify = TRUE))
##      [,1]  [,2]    [,3] [,4]  
## [1,] "man" "lives" "to" "work"
## [2,] "man" "works" "to" "live"

## function to extract unique words
unique_words <- function(x, # matrix of words
                         n = nrow(x), # number of original strings
                         nc=1 # number of columns (words) to use
                         ) {
    ## join the first nc words
    s <- stri_trim(apply(x[, 1:nc, drop = FALSE], 1, stri_join, collapse = " "))
    ## find non-duplicated word combinations, and store in column 1
    nodups <- !s %in% s[stri_duplicated(s)]
    x[nodups, 1] <- s[nodups]
    ## remove extra words from the matrix
    x[nodups, -1] <- ""
    ## if some strings are not unique, do it again, increasing nc by one
    if(any(x[, 2] != "")) {
        x <- unique_words(x = x, n = n, nc = nc + 1)
    ## otherwise, grab the unique sub-phrases from column 1    
    } else {
        x <- x[, 1]
    }
    ## return the result
    x
}    
## test it out
unique_words(word.mat)
## [1] "man lives" "man works"

## test it out with a more complicated example:
vec <- c("foo", "man lives to eat", "man eats to live",
         "woman lives to work", "woman works to live",
         "we like apples", "we like peaches",
         "they like plums", "they love peas", "bar")
unique_words(stri_split_boundaries(vec,
                                   type = "word",
                                   skip_word_none = TRUE,
                                   simplify = TRUE))
## [1] "foo"             "man lives"       "man eats"        "woman lives"    
## [5] "woman works"     "we like apples"  "we like peaches" "they like"      
## [9] "they love"       "bar"

答案 1 :(得分:1)

df %>%  unnest_tokens(word ,words) %>%
  mutate(bigram = substr(word,1,2), 
         trigram = ifelse (nchar(word) >= 3,substr(word,1,3),NA) ,
         four_gram  = ifelse (nchar(word) >= 4, substr(word,1,4), NA), 
         five_gram  = ifelse (nchar(word) >= 5, substr(word,1,5), NA)) %>%
  group_by(bigram) %>%
  mutate(count_bigram = n()) %>%
  ungroup() %>%
  group_by(trigram) %>%
  mutate(count_trigram = n()) %>%
  ungroup() %>%
  group_by(four_gram) %>%
  mutate(count_four_gram = n()) %>%
  ungroup() %>%
  group_by(five_gram) %>%
  mutate(count_five_gram = n()) %>%
  ungroup()   %>% 
  summarise_each(funs(((function(x) {sum(x == 1)})(.))), 
                 count_bigram, count_trigram, 
                 count_four_gram, count_five_gram)



# # A tibble: 1 × 4
#    count_bigram count_trigram count_four_gram count_five_gram
#          <int>         <int>           <int>           <int>
#1            0             0               0               2