标点符号是一堆文字吗?

时间:2013-09-04 05:13:40

标签: java regex nlp text-processing

我正在从头开始创建一个单词模块。我不确定这种方法中的最佳做法是否删除标点符号。考虑一下句子

I've been "DMX world center" for long time ago.Are u?

问题:对于一揽子单词,我应该考虑

  • 令牌DMX(无引号)或"DMX(包括左引号)
  • u(不带问号)或u?(带问号)

简而言之,我应该在获取不同的字词时删除所有标点符号吗?

提前致谢

更新 这是我实施的代码

示例文字:ham , im .. On the snowboarding trip. I was wondering if your planning to get everyone together befor we go..a meet and greet kind of affair? Cheers,

   HashSet<String> bagOfWords = new HashSet<String>();
   BufferedReader reader = new BufferedReader(new FileReader(path));
   while (reader.ready()) {
       String msg = reader.readLine().split("\t", 2)[1].toLowerCase(); // I get only the 2nd part. 1st part indicate wether message is spam or ham
       String[] words = msg.split("[\\s+\n.\t!?+,]"); // this is the regex that I've used to split words
       for (String word : words) {
           bagOfWords.add(word);
       }
   }

1 个答案:

答案 0 :(得分:2)

尝试替换您的代码

 while (reader.ready()) {
       String msg = reader.readLine().split("\t", 2)[1].toLowerCase(); // I get only the 2nd part. 1st part indicate wether message is spam or ham
       String[] words = msg.split("[\\s+\n.\t!?+,]"); // this is the regex that I've used to split words
       for (String word : words) {
           bagOfWords.add(word.replaceAll("[!-+.^:,\"?]"," ").trim()); // it removes all sepecial characters what you mentioned
       }
   }