在文件中查找带有标点符号的文本并将其替换

时间:2014-05-27 05:07:05

标签: regex r replace grep

朋友们,我已经问了一个相关的问题here。这里的问题是txt(关键字)没有检测到标点符号。我试图让答案通用但却失败了。< / p>

基本上我有txt(关键字)标点符号,没有标点符号,我需要在文件toSearch中搜索。

对于Ex,这些是我的文件toSearch

的内容
 [1]'Nokia. Okay. R: Samsung R: Samsung M: And you have? R: I have Micromax'
 [2]'M: Okay, you have taken car. R: I have (Mahindra Scorpio and Mahindra's) this Duro DZ.M: Okay.'
 [3]'M: What is your age ? R: 32 years R: My name is "Nitish". I have Interior designing business.'
 [4]'R: 3rd, Not extra spicy. R: 4th, Fresh. R: 5th, Variety. R: 6th, Hygienic environment'
 [5]'How you feel? How it should be? We will move forward, if there we have to make an ideal'
 [6]'What is the strength of your organisation? How many people a re working.'
 [7]'R: Read newspaper R:Had breakfast with family.'

txt(关键字)是。我使用#@来分隔关键字,因为我无法使用,(逗号)。

 txt<-"R: Samsung R: Samsung M:#@I have (Mahindra Scorpio and Mahindra's)#@R: 32 years R: My name is "Nitish"#@R: 4th, Fresh. R: 5th, Variety#@How you feel? How it should be? 

我的预期o / p正在查找匹配项中的出现位置并用下划线替换_

 [1]'Nokia. Okay. R:_Samsung_R:_Samsung_M: And you have? R: I have Micromax'
 [2]'M: Okay, you have taken car. R: I_have_(Mahindra_Scorpio_and_Mahindra's) this Duro DZ.M: Okay.'
 [3]'M: What is your age ? R:_32_years_R:_My_name_is_"Nitish". I have Interior designing business.'
 [4]'R: 3rd, Not extra spicy. R:_4th,_Fresh._R:_5th,_Variety. R: 6th, Hygienic environment'
 [5]'How_you_feel?_How_it_should_ be? We will move forward, if there we have to make an ideal'
 [6]'What is the strength of your organisation? How many people a re working.'
 [7]'R: Read newspaper R:Had breakfast with family.'

如果你们不明白它是简单的查找和替换文本(FART)功能。只有空格被_替换

我试过使用这个正则表达式

for(i in 1:length(txt))
{
    #finding the first word of the keyword 
    start <- head(strsplit(txt, split=" ")[[i]], 1)  
    n <- stri_stats_latex(txt[i])[4] 

    #all possible occurrences for the keywords in the text
    o<-unlist(regmatches(toSearch,gregexpr(paste0(start,"(?:[^a-zA-Z'-]+[a-zA-Z'-]+){0,",n-1,"}"),toSearch,ignore.case=TRUE)))  

    #exact match with the result
    p<-which(!is.na(pmatch(txt,o)))  

    #replace the keywords in the text file.
    text<-as.character(replace_all(text,txt[p],str_replace_all(txt[p]))) 
}

2 个答案:

答案 0 :(得分:2)

因此,在处理正则表达式时,您必须非常小心标点符号。如果您正在进行完全匹配,最好不要使用正则表达式并为fixed=T设置grep。因此,您可以使用Reduce

进行查找和替换
#input data
target<-c("Nokia. Okay. R: Samsung R: Samsung M: And you have? R: I have Micromax", 
"M: Okay, you have taken car. R: I have (Mahindra Scorpio and Mahindra's) this Duro DZ.M: Okay.", 
"M: What is your age ? R: 32 years R: My name is \"Nitish\". I have Interior designing business.", 
"R: 3rd, Not extra spicy. R: 4th, Fresh. R: 5th, Variety. R: 6th, Hygienic environment", 
"How you feel? How it should be? We will move forward, if there we have to make an ideal", 
"What is the strength of your organisation? How many people a re working.", 
"R: Read newspaper R:Had breakfast with family.")

kw<-c("R: Samsung R: Samsung M:", "I have (Mahindra Scorpio and Mahindra's)", 
"R: 32 years R: My name is \"Nitish\"", "R: 4th, Fresh. R: 5th, Variety", 
"How you feel? How it should be?")

这里我们使用reduce来连续替换目标文本中的每个关键字

Reduce(function (t,kw) gsub(kw, gsub(" ","_",kw), t, fixed=T), 
    kw, init=target, accumulate=F)

# [1] "Nokia. Okay. R:_Samsung_R:_Samsung_M: And you have? R: I have Micromax"                         
# [2] "M: Okay, you have taken car. R: I_have_(Mahindra_Scorpio_and_Mahindra's) this Duro DZ.M: Okay." 
# [3] "M: What is your age ? R:_32_years_R:_My_name_is_\"Nitish\". I have Interior designing business."
# [4] "R: 3rd, Not extra spicy. R:_4th,_Fresh._R:_5th,_Variety. R: 6th, Hygienic environment"          
# [5] "How_you_feel?_How_it_should_be? We will move forward, if there we have to make an ideal"        
# [6] "What is the strength of your organisation? How many people a re working."                       
# [7] "R: Read newspaper R:Had breakfast with family." 

我希望这有助于你的FART。

答案 1 :(得分:0)

一个应该适用于更大问题的简化示例。

toSearch <- c("this is some text","something else to search")
txt <- c("is some#@else to")
txt <- strsplit(txt,"#@")[[1]]
txtundsc <- gsub("\\s+","_",txt)

for(i in seq_along(txt)) { toSearch <- gsub(txt[i],txtundsc[i],toSearch) }
toSearch
# [1] "this is_some text"        "something else_to search"