R regex使用gsub替换包含特定单词/子字符串的整个单词

时间:2017-12-24 06:14:56

标签: r regex

我想使用gsub来消除列表中包含子词的所有单词。

words_to_eliminate = c("the", "of", "add", "is")
sentences = c("Other people are here", "This person is being offensive", "I'm addicted")
gsub(words_to_eliminate, "", sentences)

我想要什么

#" people are here", " person being ", "I'm "

谢谢

3 个答案:

答案 0 :(得分:2)

您可以使用paste0collapse = "|"组合交替模式,但要抓住任意一侧的字母,您需要添加一些东西来抓住它们,例如\\w*(和任何空格\\s*)。

words_to_eliminate = c("the", "of", "add", "is")
sentences = c("Other people are here", "This person is being offensive", "I'm addicted")

paste0('\\s*\\w*', words_to_eliminate, '\\w*\\s*', collapse = '|')
#> [1] "\\s*\\w*the\\w*\\s*|\\s*\\w*of\\w*\\s*|\\s*\\w*add\\w*\\s*|\\s*\\w*is\\w*\\s*"

gsub(paste0('\\s*\\w*', words_to_eliminate, '\\w*\\s*', collapse = '|'), ' ', sentences)
#> [1] " people are here" " person being "   "I'm "

但是,模式是不必要的重复,并且可以通过组(捕获(...)或非捕获(?:...)在这里工作)显着缩短,尽管它实际上需要更多代码来构建该模式:

paste0('\\s*\\w*(', paste(words_to_eliminate, collapse = '|'), ')\\w*\\s*')
#> [1] "\\s*\\w*(the|of|add|is)\\w*\\s*"

gsub(paste0('\\s*\\w*(', paste(words_to_eliminate, collapse = '|'), ')\\w*\\s*'), ' ', sentences)
#> [1] " people are here" " person being "   "I'm "

答案 1 :(得分:1)

如果我理解正确,那么你可以尝试这个,在另一个词中删除这些词:

gsub("(?>\\w*|\\s*)-(?>(\\w*|\\s*))","", gsub(paste0(words_to_eliminate,collapse="|"),"-",sentences) , perl=T)

<强>输出

   > gsub("(?>\\w*|\\s*)-(?>(\\w*|\\s*))","", gsub(paste0(words_to_eliminate,collapse="|"),"-",sentences) , perl=T)
[1] " people are here" " person being "    "I'm "   

答案 2 :(得分:1)

对于import tkinter as tk from tkinter import * import requests root = tk.Tk() root.resizable(width=False, height=False) link = requests.get('https://talaikis.com/api/quotes/random/') RESPONSE = link.json()['quote'] RESPONSE2 = link.json()['author'] new = RESPONSE.split(" ") l = [] l.append(sum(len(s) for s in new[0:5])) l.append(sum(len(s) for s in new[5:10])) l.append(sum(len(s) for s in new[10:15])) l.append(sum(len(s) for s in new[15:20])) l.append(sum(len(s) for s in new[20:25])) l.append(sum(len(s) for s in new[25:30])) l.append(sum(len(s) for s in new[30:35])) l.append(sum(len(s) for s in new[35:40])) l.append(sum(len(s) for s in new[40:45])) l.append(sum(len(s) for s in new[45:50])) l.append(sum(len(s) for s in new[50:55])) x = list(set(l)) x.sort(reverse=True) message = Label(root, text = RESPONSE + "\n-" + RESPONSE2, height=round(len(new)/5), width = x[0]) message.pack(side = tk.BOTTOM) root.mainloop() 中的每个字词,将words_to_eliminate添加到开头,将\<[a-z]*添加到结尾。试试这段代码:

[a-z]\>