Question

假设我有一个长度的字符向量，

vec <- c("man lives to work", "man works to live")

从头开始，我现在想在此向量中找到最短唯一子串（完整单词）。换句话说，我并不是在寻找总体上最短的子字符串，但我希望裁剪字符串后面的字符串变为唯一字符串，在这种情况下是{{1}之后的字符串分别和work。

所以结果应该是，在这种情况下：

lives

字符串应在 [1] "man lives" "man works" / lives后裁剪，因为这是它们成为唯一的最早点（在此上下文中）。包括works将是多余的，因为它们已经是唯一的。仅包含to是不够的，因为man不是唯一的。

（我想用它来自动生成有效的R名称，c("man", "man")将完成其余的工作。

我该怎么做？

我认为，已经成为一个已经执行此操作的程序包，但无法找到它。

Answer 1

作为一般策略，我会a）检查第一个单词是否是唯一的， b）如果没有，检查前两个单词是否唯一， c）继续，直到找到每个字符串的唯一解决方案。

您可以使用while循环或使用递归来实现此功能。以下是更新的示例（更新以保留订单）：

library(stringi) ## makes string processing easier

vec <- c("man lives to work", "man works to live")

(word.mat <- stri_split_boundaries(vec,
                                   type = "word",
                                   skip_word_none = TRUE,
                                   simplify = TRUE))
##      [,1]  [,2]    [,3] [,4]  
## [1,] "man" "lives" "to" "work"
## [2,] "man" "works" "to" "live"

## function to extract unique words
unique_words <- function(x, # matrix of words
                         n = nrow(x), # number of original strings
                         nc=1 # number of columns (words) to use
                         ) {
    ## join the first nc words
    s <- stri_trim(apply(x[, 1:nc, drop = FALSE], 1, stri_join, collapse = " "))
    ## find non-duplicated word combinations, and store in column 1
    nodups <- !s %in% s[stri_duplicated(s)]
    x[nodups, 1] <- s[nodups]
    ## remove extra words from the matrix
    x[nodups, -1] <- ""
    ## if some strings are not unique, do it again, increasing nc by one
    if(any(x[, 2] != "")) {
        x <- unique_words(x = x, n = n, nc = nc + 1)
    ## otherwise, grab the unique sub-phrases from column 1    
    } else {
        x <- x[, 1]
    }
    ## return the result
    x
}    
## test it out
unique_words(word.mat)
## [1] "man lives" "man works"

## test it out with a more complicated example:
vec <- c("foo", "man lives to eat", "man eats to live",
         "woman lives to work", "woman works to live",
         "we like apples", "we like peaches",
         "they like plums", "they love peas", "bar")
unique_words(stri_split_boundaries(vec,
                                   type = "word",
                                   skip_word_none = TRUE,
                                   simplify = TRUE))
## [1] "foo"             "man lives"       "man eats"        "woman lives"    
## [5] "woman works"     "we like apples"  "we like peaches" "they like"      
## [9] "they love"       "bar"

Answer 2

df %>%  unnest_tokens(word ,words) %>%
  mutate(bigram = substr(word,1,2), 
         trigram = ifelse (nchar(word) >= 3,substr(word,1,3),NA) ,
         four_gram  = ifelse (nchar(word) >= 4, substr(word,1,4), NA), 
         five_gram  = ifelse (nchar(word) >= 5, substr(word,1,5), NA)) %>%
  group_by(bigram) %>%
  mutate(count_bigram = n()) %>%
  ungroup() %>%
  group_by(trigram) %>%
  mutate(count_trigram = n()) %>%
  ungroup() %>%
  group_by(four_gram) %>%
  mutate(count_four_gram = n()) %>%
  ungroup() %>%
  group_by(five_gram) %>%
  mutate(count_five_gram = n()) %>%
  ungroup()   %>% 
  summarise_each(funs(((function(x) {sum(x == 1)})(.))), 
                 count_bigram, count_trigram, 
                 count_four_gram, count_five_gram)



# # A tibble: 1 × 4
#    count_bigram count_trigram count_four_gram count_five_gram
#          <int>         <int>           <int>           <int>
#1            0             0               0               2

向量的最短唯一子串

2 个答案: