我最近用R编写了文本挖掘编码,但是我在处理数据预处理方面遇到了麻烦。 我有一个如下所示的字符串:
"I want to buy 3D printer, but it costs 3000 dollars."
我想保留单词“3D”但删除“3000”,它应该如下所示:
"I want to buy 3D printer, but it costs dollars."
我使用corpus <- tm_map(corpus, removeNumbers)
但这会删除文本中的所有数字,因此我会在结果中使用“D printer”一词,但它应该是“3D打印机”。
有没有办法解决这个问题?谢谢!
答案 0 :(得分:2)
我们可以使用sub
gsub('3\\d+\\s', '', str1)
如果这需要一般,
gsub('\\b\\d+\\s', '', str1)
#[1] "I want to buy 3D printer, but it costs dollars."
答案 1 :(得分:1)
您还可以使用文本分析包,例如 quanteda ,它仅删除数字,而不删除数字。所以在你的情况下:
require(quanteda)
tokenize("I want to buy 3D printer, but it costs 3000 dollars.", removeNumbers = TRUE)
## tokenizedText object from 1 document.
## Component 1 :
## [1] "I" "want" "to" "buy" "3D" "printer" "," "but" "it" "costs" "dollars" "."
如果你想让它作为单个角色对象返回,没有标记化(虽然这可能是你的目标),那么:
paste(tokenize("I want to buy 3D printer, but it costs 3000 dollars.",
removeNumbers = TRUE, simplify = TRUE, removeSeparators = FALSE),
collapse = "")
## [1] "I want to buy 3D printer, but it costs dollars."