只删除数字但保留R中的“3D”字样?

时间:2015-12-09 06:39:45

标签: r tm

我最近用R编写了文本挖掘编码,但是我在处理数据预处理方面遇到了麻烦。 我有一个如下所示的字符串:

"I want to buy 3D printer, but it costs 3000 dollars."

我想保留单词“3D”但删除“3000”,它应该如下所示:

"I want to buy 3D printer, but it costs dollars."

我使用corpus <- tm_map(corpus, removeNumbers)但这会删除文本中的所有数字,因此我会在结果中使用“D printer”一词,但它应该是“3D打印机”。

有没有办法解决这个问题?谢谢!

2 个答案:

答案 0 :(得分:2)

我们可以使用sub

gsub('3\\d+\\s', '', str1)

如果这需要一般,

gsub('\\b\\d+\\s', '', str1)
#[1] "I want to buy 3D printer, but it costs dollars."

答案 1 :(得分:1)

您还可以使用文本分析包,例如 quanteda ,它仅删除数字,而不删除数字。所以在你的情况下:

require(quanteda)
tokenize("I want to buy 3D printer, but it costs 3000 dollars.", removeNumbers = TRUE)
## tokenizedText object from 1 document.
## Component 1 :
## [1] "I"       "want"    "to"      "buy"     "3D"      "printer" ","       "but"     "it"      "costs"   "dollars" "."      

如果你想让它作为单个角色对象返回,没有标记化(虽然这可能是你的目标),那么:

paste(tokenize("I want to buy 3D printer, but it costs 3000 dollars.",
               removeNumbers = TRUE, simplify = TRUE, removeSeparators = FALSE), 
      collapse = "")
## [1] "I want to buy 3D printer, but it costs  dollars."