I am running a CART decision tree on a training set which I've tokenized using quanteda for a routine text analysis task. The resulting DFM from tokenizing was turned into a dataframe and appended with the class attribute I am predicting for.
Like many DFMs, the table is very wide (33k columns), but only contains about 5,500 rows of documents. Calling rpart on my training set returns a stack overflow error.
If it matters, to help increase the speed of calculations, I am using the doSNOW library so I can run the model on 3 out of 4 of my cores in parallel.
I've looked at this answer上重置样式,但无法弄清楚如何在Mac工作站上进行等效操作以查看相同的解决方案是否对我有用。即使我增加RStudio的ppsize,也有可能仍然会遇到此错误。
所以我的问题是如何在Mac上增加RStudio 的最大大小,或更普遍地说,如何解决此堆栈溢出问题,以便运行模型?
谢谢!
答案 0 :(得分:0)
最后,我发现Mac没有相同的命令行选项,因为RStudio的Mac版本默认使用所有可用内存。
所以我解决此问题的方法是通过减少稀疏性来降低任务的复杂性。我通过删除所有在语料库中至少出现5%的标记来清除文档术语矩阵。这足以将矩阵从33k列减少到更易于管理的3k列,同时仍然可以实现高度代表性的DFM。