查找并提取包含R中标点符号表达式的单词

时间:2015-08-25 15:24:00

标签: regex r grep text-analysis

我试图从大文本(约17000个文档)中提取包含标点符号表达式的单词。例如

"...urine bag tubing and the vent jutting above the summit also strapped with the
 white plaster tapeFigure 2), \n\nc(A<sc>IMS AND</sc> O<sc>BJECTIVES</sc>, The 
 aim of this study is to ... c(M<sc>ATERIALS AND</sc> M<sc>ETHODS</sc>, A 
 cross-sectional study with a ... surgeries.n), \n\nc(PATIENTS & METHODS, This 
 prospective double blind,...[95] c(c(Introduction, Silicosis is a fibrotic"

我想提取如下字样:

 [1] c(A<sc>IMS AND</sc> M<sc>ETHODS</sc>
 [2] c(M<sc>ATERIALS AND</sc> M<sc>ETHODS</sc>
 [3] c(PATIENTS & METHODS,
 [4] c(c(Introduction

但不是像“横截面”,“2013”​​,或“2)”或“(无能力”等字样。这是第一步,我的想法是能够达到这个目的:

"...urine bag tubing and the vent jutting above the summit also strapped with the
 white plaster tapeFigure 2), \n\n AIMS AND OBJECTIVES, The aim of this 
 study is to ... MATERIALS AND METHODS, A cross-sectional study with a ...
 surgeries.n), \n\n PATIENTS AND METHODS, This prospective double blind,...
 [95] Introduction Silicosis is a fibrotic"

作为一种提取这些单词并且不抓取任何包含标点符号的单词的方法(如“surgeries.n”),我看到他们总是开始或包含“c(”表达式。但是在正则表达式上遇到了一些麻烦:

grep("c(", test)
    Error en grep("c(", test) : 
    invalid regular expression 'c(', reason 'Missing ')''

也尝试过:

grep("c\\(", test, value = T)

但是返回整个文本文件。也使用dap包中的str_match但我似乎没有得到正确的模式(正则表达式)代码。有什么建议吗?

2 个答案:

答案 0 :(得分:0)

试试这个,

text <- "...urine bag tubing and the vent jutting above the summit also strapped with the white plaster tapeFigure 2), \n\nc(A<sc>IMS AND</sc> O<sc>BJECTIVES</sc>, The aim of this study is to ... c(M<sc>ATERIALS AND</sc> M<sc>ETHODS</sc>, A cross-sectional study with a ... surgeries.n), \n\nc(PATIENTS & METHODS, This prospective double blind,...[95] c(c(Introduction, Silicosis is a fibroticf"

require(stringr)
words <- str_split(text, " ")
words[[1]][grepl("c\\(", words[[1]])]
## [1] "\n\nc(A<sc>IMS"    "c(M<sc>ATERIALS"   "\n\nc(PATIENTS"    "c(c(Introduction,"

答案 1 :(得分:0)

如果我理解你的问题(我不确定你的第二个文本是预期输出还是只是一步)我会像这样使用gsub:

gsub("(c\\(|<\\/?sc>)","",text)

正则表达式(第一个参数)将与c(<sc></sc>匹配,并将其替换为空白,从而按预期清理文本(如果我理解您的期望,请再次)。

更多关于所涉及的正则表达式:

  • (|)是OR条件的结构
  • c\\(将在文字
  • 的任意位置按字面意思c(进行匹配
  • <\\/?sc><sc></sc>?匹配,/意味着它可以有0或1次,所以它是可选的。
  • \\就在那里,所以在R解释器删除了第一个反斜杠之后,仍然有一个反斜杠告诉正则表达式解释器我们要匹配一个小的(和一个小的/ < / LI>