我有一个像这样的字符串向量:
&__burger-cross
我想要一个字符串向量,它将过滤掉输入向量的任何元素的任何完整子字符串匹配。 也就是说,结果就像:
"I love Mangoes." , "I love Mangoes and Apples." , "Apples are good for health" , "I live in America" , "I love Mangoes and Apples and Strawberries." , "Mangoes and Apples." , "Mangoes and Apples and Honey"
订单无关紧要。 这里,前两个条目被删除,因为它们是第三个最后一个条目的子串。删除第二个最后一个条目,因为它也是先前条目的子字符串。
任何帮助将不胜感激。这是我正在对语料库进行短语检测的一部分。
答案 0 :(得分:2)
您可以使用带有边界的grepl
来捕获与每个元素匹配的精确字符串。有一个以上匹配(一个=自己)的那些是要丢弃的,即
R - 解决方案
v1 = colSums(sapply(x, function(i) grepl(paste0('\\b', i, '\\b'), x))) <= 1
names(v1)[v1]
#[1] "Apples are good for health" "I live in America" "I love Mangoes and Apples and Strawberries."
#[4] "Mangoes and Apples and Honey"
Python - 解决方案
import re
from itertools import compress
v2 = []
for i in x:
i1 = sum([re.search(i, a) is not None for a in x]) == 1
v2.append(i1)
list(compress(x, v2))
#['Apples are good for health', 'I live in America', 'I love Mangoes and Apples and Strawberries.', 'Mangoes and Apples and Honey']
答案 1 :(得分:1)
你可以这样做......
vec <- c("I love Mangoes." , "I love Mangoes and Apples." , "Apples are good for health" ,
"I live in America" , "I love Mangoes and Apples and Strawberries." ,
"Mangoes and Apples." , "Mangoes and Apples and Honey")
vec <- vec[order(nchar(vec))] #sort by string length
vec[!c(sapply(2:length(vec), #iterate from shortest to longest
function(i) any(grepl(vec[i-1], vec[i:length(vec)]))), #check whether shorter is included in any longer
FALSE)] #add value for final (longest) entry
[1] "I live in America" "Apples are good for health"
[3] "Mangoes and Apples and Honey" "I love Mangoes and Apples and Strawberries."
答案 2 :(得分:1)
我们还可以使用combn
枚举所有成对字符串比较,然后对所有成对组合使用grepl
来删除在其他字符串中匹配的字符串。
df <- as.data.frame(combn(s, 2));
rmv <- unique(unname(unlist(df[1, sapply(df, function(x) grepl(x[1], x[2]))])))
s[!(s %in% rmv)]
#[1] "Apples are good for health"
#[2] "I live in America"
#[3] "I love Mangoes and Apples and Strawberries"
#[4] "Mangoes and Apples and Honey"
s <- c(
"I love Mangoes" ,
"I love Mangoes and Apples" ,
"Apples are good for health" ,
"I live in America" ,
"I love Mangoes and Apples and Strawberries" ,
"Mangoes and Apples" ,
"Mangoes and Apples and Honey")