我有一个NLP分类问题,可以将短语/单词分为两类。这些特征以单词和短语的形式出现,其中一些单词和短语在特定的观察中重复出现。例如
| Observation | label |
|:--------------------------------------------------:|:------:|
| 'dog, jump, eat, drink water, jump' | animal |
| 'run, jump, travel, sleep, talk, grocery shopping' | human |
| 'swim, language, jump, eat, go to school' | human |
| 'bite, lick, growl, eat, run, lick, scratch' | animal |
我尝试使用带有标记化(CountVectorizer和TFIDVectorizer)输入文本的仿射DNN和ConvNets,以及带有和不带正则化的标记化输入+字嵌入。验证准确性似乎永远不会超过76%。有关如何提高模型性能的方法吗?