stri_replace_all_regex不接受导入的模式替换文件的结果

时间:2016-05-18 17:17:59

标签: r applescript stringr stringi

我有一个找到并替换大约一百个术语的苹果。使用正则表达式。我想在R中导入这个查找和替换函数。因此,在ScriptEditor中,我将AppleScript保存为文本文件,并通过readLines()将其导入R.此导入的dput()结果类似于punct.out,如下所示。当我从原始向量创建自己的模式和替换数据框时,而不是从导入(参见下面的节点),然后在测试字符串上查找和替换(参见下面的测试)工作得很好。但是,当我使用导入的数据框尝试相同的命令时,它不起作用,它返回NA。

所以不知何故,导入的文本结果不会被解释为正则表达式或字符向量...我无法弄明白。

#structure of my imported patterns and replacements
punct.out<-structure(list(replace = c(NA, NA, "good-bye[a-z]+|good-bye", 
"good bye[a-z]+|good bye", "good-", "ill at ease", "ill-", "-like", 
" well,", "- well,", ", well,", "as well", ".,", ".... well", 
"... well", ". Well,", ": well,", "well-", "well,", "well,", 
"well,", "Well,", "- okay,", ", okay,", "okay,", " okay,", ".... okay", 
"... okay", ". Okay,", ": okay,", "OK", "'okay,", "okay,", "Okay,", 
"Okay", ", too", "too /", "too,", "too.", "too?", "too:", "(No)(. )([0-    9]+)", 
"( [A-Z])(.)( )", "www.", "ain't", "let's", "won't", "can't", 
"n't", "cannot", "'d", "'ll", "'m", "'ve", "'re", "!", "?", ";", 
"", ",", "--", "-", "-", "é", "è", "à", "ç", "&", "%", "per cent", 
"_", "Que.", "Ont.", "Nfld.", "Alta.", "Man.", "Sask.", "St.", 
"Ste.", "i.e.", "Mr.", "Ms.", "Mrs.", "Prof.", ".com", "a. m.", 
"p. m.", "a.m.", "p.m.", "Jan.", "Feb.", "Mar.", "Apr.", "Jun.", 
"Jul.", "Aug.", "Sept.", "Oct.", "Nov.", "Dec.", "gen.", "Dr.", 
"e. coli", "(.)([A-Z])(.)", "([A-Z])(.)([A-Z])", "([A-Z])(.)([A-Z])", 
"([A-Z])(.)([A-Z])", "([A-Z])(.)([A-Z])", "([A-Z])(.)([A-Z])", 
"([0-9])(.)([0-9])", "()(S)", "([a-z]+)(')", "(')([a-z]+)", "bull ' s eye", 
"no man ' s land", "pandora ' s box", "....", "...", ".", ",", 
":", "", "", "", "", NA, NA), with = c("character(0)", "character(0)", 
"goodbye", "goodbye", "good x", "ill at xease", "ill x", " xlike", 
" xwell", " xwell", " xwell", "as xwell", " ", " xwell", " xwell", 
". xWell", ": xwell", "well x", "xwell", " xwell", "xwell", "xWell", 
" xokay", " xokay", " xokay", " xokay", " xokay", " xokay", ". xOkay", 
": xokay", "okay", "xokay", "xokay", "xOkay", "xOkay", " xtoo", 
"xtoo /", "xtoo", "xtoo.", "xtoo.", "xtoo", "#\\\\3", "\\\\1\\\\3", 
"www", "am not", "let us", "will not", "can not", " not", "can not", 
" would", " will", " am", " have", " are", ".", ".", "", "", 
"", " ", " ", " ", "e", "e", "a", "c", "and", "percent", "percent", 
" ", "Que", "Ont", "Nfld", "Alta", "Man", "Sask", "St", "Ste", 
"ie", "Mr", "Ms", "Mrs", "Prof", "com", "am", "pm", " am", " pm", 
"Jan", "Feb", "Mar", "Apr", "Jun", "Jul", "Aug", "Sept", "Oct", 
"Nov", "Dec", "gen", "Dr", "e coli", "\\\\1\\\\2 ", "\\\\1\\\\3", 
"\\\\1\\\\3", "\\\\1\\\\3", "\\\\1\\\\3", "\\\\1\\\\3", "\\\\1dot\\\\3", 
"\\\\1 \\\\2", "\\\\1 \\\\2", "\\\\1 \\\\2", "bull's eye", "no man's land", 
"pandora's box", "", "", " . ", " ,", "", " ", " ", " ", " ", 
"character(0)", "character(0)")), .Names = c("replace", "with"
), row.names = c(NA, -127L), class = "data.frame")

#library
library(stringi)
#test string
test<-c('Sept.','Mr.' ,'Oct.', 'ill at ease', 'as well', 'Dr.', 'OK'   
, 'well,', '.com')
#data frame of patterns and replacements
punct<-data.frame(replace=c('ill at ease', 'Sept.', 'Mr.', 'Oct.', 'as    
well',    'Dr.', 'OK', 'well,', '.com'), with=c('ill at xease', 'Sept', 
'Mr', 'Oct', 'as   xwell', 'Dr', 'okay', 'xwell', 'com'))
#This works
stri_replace_all_regex(test, punct$replace, punct$with, vectorize_all=F)
#But this doesn't
stri_replace_all_regex(test, punct.out$replace, punct.out$with,    
vectorize_all=F)

第二个问题: 我根据下面的评论解决了上述问题。但是,一些正则表达式的出现存在一些具体问题。具体来说,我不知道如何逃避反斜杠打印正则表达式中匹配的第一个和第二个模式,即\ 1,\ 2等。

#Define data
punct.out<-structure(list(replace = c("(\\.)([A-Z])(\\.)", "([A-Z])(\\.)([A-  
Z])", 
"([0-9])(\\.)([0-9])", "([a-z]+)(')", "(')   ([a-z]+)"), with =   
c("\\\\1\\\\2 ",                                                                                                          
"\\\\1\\\\3", "\\\\1dot\\\\3", "\\\\1 \\\\2", "\\\\1 \\\\2")), .Names = 
c("replace",                                                                                                                                                                           
"with"), row.names = c(104L, 105L, 110L, 112L, 113L), class = "data.frame")
#Test string of characters that the above regex's are supposed to match
test<-c('.B.', 'B.B', '1.1','premier\'s')
#This sort of works but I clearly haven't figured out how to properly escape 
the backslashes to capture the references
stri_replace_all_regex(test,punct.out$replace, punct.out$with, 
vectorize_all=F)
#Based on the help for stri_replace I also tried using $ to capture the    
references.
punct.out$with<-gsub('\\\\\\\\', '$', punct.out$with)
#And it did work.
stri_replace_all_regex(test,punct$replace, punct$with, vectorize_all=F)

1 个答案:

答案 0 :(得分:1)

punct.out包含缺失的观察结果。这就是你输出NA的原因。例如,您应首先使用na.omit。此外,当您执行正则表达式匹配时,某些字符(例如.)应该被转义,即以反斜杠开头。另请注意,第一列中有一些空字符串 - 它们也应该被删除。