删除R中长度大于X的单词

时间:2015-09-21 13:57:17

标签: regex r gsub corpus

在删除标点符号,数字和非ascii字符后的R编程中,我留下了很多带有长字符的单词:

ques1<-gsub("[[:digit:]]"," ", ques1,perl=TRUE)
ques1<-gsub("[[:punct:]]"," ", ques1,perl=TRUE)

ques1<-iconv(ques1, "latin1", "ASCII", sub=" ")
ques1<-rm_white(ques1)
ques1

我使用

检查了最长的角色是35
max(nchar(strsplit(ques1, " ")[[1]]))
[1] 35

现在,我想删除超过10个字符的单词,因为我不想要它们,例如

wwwhotmailcomlearnbyexample

请帮帮我!!!

1 个答案:

答案 0 :(得分:4)

使用以下gsub:

ques1 = "A long sentence with long wwwhotmailcomlearnbyexample"
gsub("\\b[[:alpha:]]{11,}\\b", "", ques1, perl=T)

\\b[[:alpha:]]{11,}\\b正则表达式将匹配长度为11或更长的单词(\\b是单词边界,[:alpha:]代表任何字母。)

请参阅IDEONE demo