在GPT2的预处理步骤中,我们到底应该怎么做?有准则吗?
这对预处理步骤好吗?
1. Remove any \n from sentence
2. Remove extra spaces from sentence
3. Leave everything else that is part of the sentence but not exactly words (e.g. urls, non-english words that may be added in an english sentence, emojis, etc...)
删除多余的标点符号或任何非英语字符会更好吗?