如何用tm包中的单词删除括号?

时间:2015-10-16 11:08:17

标签: r tm punctuation

我们说我在这样的文件中有部分文本:

"Other segment comprised of our active pharmaceutical ingredient (API) business,which..."

我想删除"(API)",它需要在

之前完成
corpus <- tm_map(corpus, removePunctuation) 

删除&#34;(API)&#34;后,它应如下所示:

"Other segment comprised of our active pharmaceutical ingredient business,which..."

我搜索了很长时间,但我只能找到关于删除括号的答案,我内心中的单词也不想出现在语料库中。

我真的需要有人给我一些提示。

2 个答案:

答案 0 :(得分:1)

如果它只是单个单词,那么(未经测试)如何:

removeBracketed <- content_transformer(function(x, ...) {gsub("\\(\\w+\\)", "", x)})
tm_map(corpus, removeBracketed)

答案 1 :(得分:1)

您可以使用更智能的标记器,例如 quanteda 包中的标记器,removePunct = TRUE将自动删除括号。

quanteda::tokenize(txt, removePunct = TRUE)
## tokenizedText object from 1 document.
## Component 1 :
##  [1] "Other"          "segment"        "comprised"      "of"             "our"            ## "active"         "pharmaceutical"
##  [8] "ingredient"     "API"            "business"       "which"         

<强>加了:

如果您想先将文字标记,那么您需要lapply gsub,直到我们在 quanteda valuetype添加正则表达式removeFeatures.tokenizedTexts() STRONG>。但这可行:

# tokenized version
require(quanteda)
toks <- tokenize(txt, what = "fasterword", simplify = TRUE)
toks[-grep("^\\(.*\\)$", toks)]
## [1] "Other"             "segment"           "comprised"         "of"                "our"               "active"           
## [7] "pharmaceutical"    "ingredient"        "business,which..."

如果您只是想删除问题中的括号表达式,那么您不需要 tm quanteda

# exactly as in the question
gsub("\\s(\\(\\w*\\))(\\s|[[:punct:]])", "\\2", txt)
## [1] "Other segment comprised of our active pharmaceutical ingredient business,which..."

# with added punctuation
txt2 <- "ingredient (API), business,which..."
txt3 <- "ingredient (API).  New sentence..."
gsub("\\s(\\(\\w*\\))(\\s|[[:punct:]])", "\\2", txt2)
## [1] "ingredient, business,which..."
gsub("\\s(\\(\\w*\\))(\\s|[[:punct:]])", "\\2", txt3)
## [1] "ingredient.  New sentence..."

较长的正则表达式还会捕获括号表达式结束句子或后跟其他标点符号(如逗号)的情况。