Question

我目前正在尝试在正在使用的文本语料库上创建基于句子的LDA。为了检测句子并将它们拆分，我使用了sent_detect()包中的openNLP函数。

但是，我正在使用的数据集非常不干净，并且包含许多我想在使用sent_detect()函数之前摆脱的“标点符号”。

通常，我将在文本语料库上使用以下代码（来自tm包）来删除标点符号：
text.corpus <- tm_map(text.corpus, removePunctuation)

但是，此功能将删除{所使用的所有标点符号，包括 “。”，“？”，“！”，“ |” {1}}用于检测句子。因此，这将破坏我将文本拆分为单独句子的目的。

是否可以通过上述sent_detect()函数删除标点符号，但排除特定的“句子指示符”（*“。”，“？”，“！”，“ |” **）？

这是一个文本示例：

不好笑； -我根本不喜欢电影/电影（因为演员很糟糕）。但是，我真的很喜欢风景！

通常，以上tm_map()会删除所有标点符号并留下以下句子：

不好笑，我一点都不喜欢电影，因为演员很糟糕，但是我真的很喜欢风景

但是，我要结束的是：

不好笑，我一点都不喜欢这部电影，因为演员们都很糟糕。但是我真的很喜欢风景！

谢谢！

Ps：使用openNLP软件包不是必须的，我也愿意接受任何其他解决方案！

Answer 1

您可以使用gsub来定义要删除的所有字符作为模式，将它们与替换标记|连接起来，并确保诸如(和{{1 }}使用)进行了正确的转义，并用\\替换了模式，即在替换参数中什么也没做：

""

数据：

gsub(";|- |/ |,|\\(|\\)", "", s)
[1] "not funny i did not like the movie film at all since the actors were terrible. however i really enjoyed the scenery!"

Answer 2

使用stringr和一个not-not-statement（感谢Chris Ruehlemann的评论）：

s <- "not funny; - i did not like the movie / film at all (since the actors were terrible). however, i really enjoyed the scenery!"

str_remove_all(s, "[^[^[[:punct:]]]!|.|?]")
[1] "not funny  i did not like the movie  film at all since the actors were terrible. however i really enjoyed the scenery!"

删除R中的标点符号，但保留标点符号/“句子标记”“！”，“。”，“？”在句子结尾处

2 个答案: