在tm文本挖掘R-package的源代码中,在文件transform.R中,有removePunctuation()
函数,当前定义为:
function(x, preserve_intra_word_dashes = FALSE)
{
if (!preserve_intra_word_dashes)
gsub("[[:punct:]]+", "", x)
else {
# Assume there are no ASCII 1 characters.
x <- gsub("(\\w)-(\\w)", "\\1\1\\2", x)
x <- gsub("[[:punct:]]+", "", x)
gsub("\1", "-", x, fixed = TRUE)
}
}
我需要从科学会议中解析并挖掘一些摘要(从他们的网站上获取为UTF-8)。摘要包含一些需要删除的unicode字符,特别是在字边界处。有通常的ASCII标点符号,还有一些Unicode破折号,Unicode引号,数学符号......
文本中也有URL,标点符号需要保留字符标点符号。 tm的内置removePunctuation()
功能过于激进。
所以我需要一个自定义removePunctuation()
功能来根据我的要求进行删除。
我的自定义Unicode功能现在看起来像这样,但它不能按预期工作。我很少使用R,所以在R中完成任务需要一些时间,即使对于最简单的任务也是如此。
我的功能:
corpus <- tm_map(corpus, rmPunc = function(x){
# lookbehinds
# need to be careful to specify fixed-width conditions
# so that it can be used in lookbehind
x <- gsub('(.*?)(?<=^[[:punct:]’“”:±</>]{5})([[:alnum:]])'," \\2", x, perl=TRUE) ;
x <- gsub('(.*?)(?<=^[[:punct:]’“”:±</>]{4})([[:alnum:]])'," \\2", x, perl=TRUE) ;
x <- gsub('(.*?)(?<=^[[:punct:]’“”:±</>]{3})([[:alnum:]])'," \\2", x, perl=TRUE) ;
x <- gsub('(.*?)(?<=^[[:punct:]’“”:±</>]{2})([[:alnum:]])'," \\2", x, perl=TRUE) ;
x <- gsub('(.*?)(?<=^[[:punct:]’“”:±</>])([[:alnum:]])'," \\2", x, perl=TRUE) ;
# lookaheads (can use variable-width conditions)
x <- gsub('(.*?)(?=[[:alnum:]])([[:punct:]’“”:±]+)$',"\1 ", x, perl=TRUE) ;
# remove all strings that consist *only* of punct chars
gsub('^[[:punct:]’“”:±</>]+$',"", x, perl=TRUE) ;
}
它没有按预期工作。我想,它什么都不做。 标点符号仍在术语 - 文档矩阵中,请参阅:
head(Terms(tdm), n=30)
[1] "<></>" "---"
[3] "--," ":</>"
[5] ":()" "/)."
[7] "/++" "/++,"
[9] "..," "..."
[11] "...," "..)"
[13] "“”," "(|)"
[15] "(/)" "(.."
[17] "(..," "()=(|=)."
[19] "()," "()."
[21] "(&)" "++,"
[23] "(0°" "0.001),"
[25] "0.003" "=0.005)"
[27] "0.006" "=0.007)"
[29] "000km" "0.01)"
...
所以我的问题是:
\P{ASCII}
或\P{PUNCT}
表情?我认为它们(默认情况下)不是PCRE::“只有支持带有\ p的各种Unicode属性才是不完整的,尽管支持最重要的属性。”答案 0 :(得分:2)
尽管我喜欢Susana的回答,但是它在新版本的 tm 中打破了语料库(不再是PlainTextDocument并且破坏了元数据)
您将收到列表并出现以下错误:
Error in UseMethod("meta", x) :
no applicable method for 'meta' applied to an object of class "character"
使用
tm_map(your_corpus, PlainTextDocument)
会回复你的语料库,但是会损坏$ meta(特别是文档ID将丢失。
<强>解决方案强>
使用 content_transformer
toSpace <- content_transformer(function(x,pattern)
gsub(pattern," ", x))
your_corpus <- tm_map(your_corpus,toSpace,"„")
来源: 用R实践数据科学, 文本挖掘, Graham.Williams@togaware.com http://onepager.togaware.com/
此函数删除所有非字母数字(即UTF-8表情符号等)
removeNonAlnum <- function(x){
gsub("[^[:alnum:]^[:space:]]","",x)
}
答案 1 :(得分:1)
我遇到了同样的问题,自定义功能无效,但实际上必须添加以下第一行
此致
苏珊娜
replaceExpressions <- function(x) UseMethod("replaceExpressions", x)
replaceExpressions.PlainTextDocument <- replaceExpressions.character <- function(x) {
x <- gsub(".", " ", x, ignore.case =FALSE, fixed = TRUE)
x <- gsub(",", " ", x, ignore.case =FALSE, fixed = TRUE)
x <- gsub(":", " ", x, ignore.case =FALSE, fixed = TRUE)
return(x)
}
notes_pre_clean <- tm_map(notes, replaceExpressions, useMeta = FALSE)