我想从data.frame文件中删除文本中的标点符号,数字和http链接。我尝试了tm,stringer,quanteda,tidytext程序包,但是它们都不起作用。我正在寻找一个有用的基本数据包或函数,用于干净的data.frame文件,而无需将其转换为语料库或类似的东西。
mycorpus <-tm_map(mycorpus,content_transformer(remove_url)) 警告信息: 在tm_map.SimpleCorpus(mycorpus,content_transformer(remove_url))中: 转换删除文档
mycorpus <-tm_map(mycorpus,removePunctuation) 警告信息: 在tm_map.SimpleCorpus(mycorpus,removePunctuation)中: 转换删除文档
而且,当我尝试查看一些包含任何符号的推文时: nchar(输出)错误:无效的多字节字符串,元素1
mycorpus <-tm_map(mycorpus,content_transformer(tolower)) FUN(content(x),...)中的错误: 输入无效
答案 0 :(得分:2)
由于您尚未发布任何示例输入或示例输出,因此无法对其进行测试,要从数据框的特定列中删除标点符号,数字和http链接,您可以尝试执行一次。
gsub("[[:punct:]]|[[:digit:]]|^http:\\/\\/.*|^https:\\/\\/.*","",df$column)
根据Rui在评论中的建议,也可以使用以下内容。
gsub("[[:punct:]]|[[:digit:]]|(http[[:alpha:]]*:\\/\\/)","",df$column)
答案 1 :(得分:0)
如果您旨在通过替换所有非字符来仅保留字符,则可以实现一个简洁的版本。此外,我猜您想用空格代替它,因为您提到了有关语料库的内容。否则,您的地址将被折叠成不长的字符串(但这也许就是您想要的-如所述,您可能会提供一个示例)。
x = c("https://stackoverflow.com/questions/51582369/how-can-i-remove-punctuations-and-numbers-in-text-from-data-frame-file-in-r"
, "http://stackoverflow.com/questions/51582369/how-can-i-remove-punctuations-and-numbers-in-text-from-data-frame-file-in-r")
gsub("\\W|\\d|http\\w?", " ", x, perl = T)
# [1] " stackoverflow com questions how can i remove punctuations and numbers in text from data frame file in r"
# [2] " stackoverflow com questions how can i remove punctuations and numbers in text from data frame file in r"
the same task for a data frame of 100000 rows
# make sure that your strings are not factors
df = data.frame(id = 1:1e5, url = rep(x, 1e5/2), stringsAsFactors = FALSE)
# df before replacement
df[1:4, ]
# id url
# 1 1 https://stackoverflow.com/questions/51582369/how-can-i-remove-punctuations-and-numbers-in-text-from-data-frame-file-in-r
# 2 2 http://stackoverflow.com/questions/51582369/how-can-i-remove-punctuations-and-numbers-in-text-from-data-frame-file-in-r
# 3 3 https://stackoverflow.com/questions/51582369/how-can-i-remove-punctuations-and-numbers-in-text-from-data-frame-file-in-r
# 4 4 http://stackoverflow.com/questions/51582369/how-can-i-remove-punctuations-and-numbers-in-text-from-data-frame-file-in-r
# apply replacement on a specific column and assign result back to this column
df$url = gsub("\\W|\\d|http\\w?", " ", df$url, perl = T)
# check output
df[1:4, ]
# id url
# 1 1 stackoverflow com questions how can i remove punctuations and numbers in text from data frame file in r
# 2 2 stackoverflow com questions how can i remove punctuations and numbers in text from data frame file in r
# 3 3 stackoverflow com questions how can i remove punctuations and numbers in text from data frame file in r
# 4 4 stackoverflow com questions how can i remove punctuations and numbers in text from data frame file in r