假设我有一个长度的字符向量,
vec <- c("man lives to work", "man works to live")
从头开始,我现在想在此向量中找到最短唯一子串(完整单词)。
换句话说,我并不是在寻找总体上最短的子字符串,但我希望裁剪字符串后面的字符串变为唯一字符串,在这种情况下是{{1}之后的字符串分别和work
。
所以结果应该是,在这种情况下:
lives
字符串应在 [1] "man lives" "man works"
/ lives
后裁剪,因为这是它们成为唯一的最早点(在此上下文中)。
包括works
将是多余的,因为它们已经是唯一的。
仅包含to
是不够的,因为man
不是唯一的。
(我想用它来自动生成有效的R名称,c("man", "man")
将完成其余的工作。
我该怎么做?
我认为,已经成为一个已经执行此操作的程序包,但无法找到它。
答案 0 :(得分:2)
作为一般策略,我会a)检查第一个单词是否是唯一的, b)如果没有,检查前两个单词是否唯一, c)继续,直到找到每个字符串的唯一解决方案。
您可以使用while
循环或使用递归来实现此功能。以下是更新的示例(更新以保留订单):
library(stringi) ## makes string processing easier
vec <- c("man lives to work", "man works to live")
(word.mat <- stri_split_boundaries(vec,
type = "word",
skip_word_none = TRUE,
simplify = TRUE))
## [,1] [,2] [,3] [,4]
## [1,] "man" "lives" "to" "work"
## [2,] "man" "works" "to" "live"
## function to extract unique words
unique_words <- function(x, # matrix of words
n = nrow(x), # number of original strings
nc=1 # number of columns (words) to use
) {
## join the first nc words
s <- stri_trim(apply(x[, 1:nc, drop = FALSE], 1, stri_join, collapse = " "))
## find non-duplicated word combinations, and store in column 1
nodups <- !s %in% s[stri_duplicated(s)]
x[nodups, 1] <- s[nodups]
## remove extra words from the matrix
x[nodups, -1] <- ""
## if some strings are not unique, do it again, increasing nc by one
if(any(x[, 2] != "")) {
x <- unique_words(x = x, n = n, nc = nc + 1)
## otherwise, grab the unique sub-phrases from column 1
} else {
x <- x[, 1]
}
## return the result
x
}
## test it out
unique_words(word.mat)
## [1] "man lives" "man works"
## test it out with a more complicated example:
vec <- c("foo", "man lives to eat", "man eats to live",
"woman lives to work", "woman works to live",
"we like apples", "we like peaches",
"they like plums", "they love peas", "bar")
unique_words(stri_split_boundaries(vec,
type = "word",
skip_word_none = TRUE,
simplify = TRUE))
## [1] "foo" "man lives" "man eats" "woman lives"
## [5] "woman works" "we like apples" "we like peaches" "they like"
## [9] "they love" "bar"
答案 1 :(得分:1)
df %>% unnest_tokens(word ,words) %>%
mutate(bigram = substr(word,1,2),
trigram = ifelse (nchar(word) >= 3,substr(word,1,3),NA) ,
four_gram = ifelse (nchar(word) >= 4, substr(word,1,4), NA),
five_gram = ifelse (nchar(word) >= 5, substr(word,1,5), NA)) %>%
group_by(bigram) %>%
mutate(count_bigram = n()) %>%
ungroup() %>%
group_by(trigram) %>%
mutate(count_trigram = n()) %>%
ungroup() %>%
group_by(four_gram) %>%
mutate(count_four_gram = n()) %>%
ungroup() %>%
group_by(five_gram) %>%
mutate(count_five_gram = n()) %>%
ungroup() %>%
summarise_each(funs(((function(x) {sum(x == 1)})(.))),
count_bigram, count_trigram,
count_four_gram, count_five_gram)
# # A tibble: 1 × 4
# count_bigram count_trigram count_four_gram count_five_gram
# <int> <int> <int> <int>
#1 0 0 0 2