过滤掉字符串向量中的子字符串

时间:2018-04-18 12:55:42

标签: python r regex

我有一个像这样的字符串向量:

&__burger-cross

我想要一个字符串向量,它将过滤掉输入向量的任何元素的任何完整子字符串匹配。 也就是说,结果就像:

"I love Mangoes." , "I love Mangoes and Apples." , "Apples are good for health" , "I live in America" , "I love Mangoes and Apples and Strawberries." , "Mangoes and Apples." , "Mangoes and Apples and Honey"

订单无关紧要。 这里,前两个条目被删除,因为它们是第三个最后一个条目的子串。删除第二个最后一个条目,因为它也是先前条目的子字符串。

任何帮助将不胜感激。这是我正在对语料库进行短语检测的一部分。

3 个答案:

答案 0 :(得分:2)

您可以使用带有边界的grepl来捕获与每个元素匹配的精确字符串。有一个以上匹配(一个=自己)的那些是要丢弃的,即

R - 解决方案

v1 = colSums(sapply(x, function(i) grepl(paste0('\\b', i, '\\b'), x))) <= 1
names(v1)[v1]
#[1] "Apples are good for health"  "I live in America" "I love Mangoes and Apples and Strawberries."
#[4] "Mangoes and Apples and Honey" 

Python - 解决方案

import re
from itertools import compress

v2 = []
for i in x:
    i1 = sum([re.search(i, a) is not None for a in x]) == 1
    v2.append(i1)

list(compress(x, v2))
#['Apples are good for health', 'I live in America', 'I love Mangoes and Apples and Strawberries.', 'Mangoes and Apples and Honey']

答案 1 :(得分:1)

你可以这样做......

vec <- c("I love Mangoes." , "I love Mangoes and Apples." , "Apples are good for health" , 
         "I live in America" , "I love Mangoes and Apples and Strawberries." , 
         "Mangoes and Apples." , "Mangoes and Apples and Honey")

vec <- vec[order(nchar(vec))] #sort by string length

vec[!c(sapply(2:length(vec), #iterate from shortest to longest
              function(i) any(grepl(vec[i-1], vec[i:length(vec)]))), #check whether shorter is included in any longer
       FALSE)] #add value for final (longest) entry

[1] "I live in America"                           "Apples are good for health"                 
[3] "Mangoes and Apples and Honey"                "I love Mangoes and Apples and Strawberries."

答案 2 :(得分:1)

我们还可以使用combn枚举所有成对字符串比较,然后对所有成对组合使用grepl来删除在其他字符串中匹配的字符串。

df <- as.data.frame(combn(s, 2));
rmv <- unique(unname(unlist(df[1, sapply(df, function(x) grepl(x[1], x[2]))])))
s[!(s %in% rmv)]
#[1] "Apples are good for health"
#[2] "I live in America"
#[3] "I love Mangoes and Apples and Strawberries"
#[4] "Mangoes and Apples and Honey"

样本数据

s <- c(
    "I love Mangoes" ,
    "I love Mangoes and Apples" ,
    "Apples are good for health" ,
    "I live in America" ,
    "I love Mangoes and Apples and Strawberries" ,
    "Mangoes and Apples" ,
    "Mangoes and Apples and Honey")