StanfordNLP分类器内存不足错误

时间:2018-10-24 22:27:41

标签: java stanford-nlp

我正在使用StanfordNLP来对一些文本进行分类。当我使用最多16万行的训练文件时,它可以正常工作。但是,如果我使用较大的,则会收到java.lang.OutOfMemoryError: Java heap space

我正在使用以下属性:

e.s.n.c.ColumnDataClassifier - Setting ColumnDataClassifier properties
e.s.n.c.ColumnDataClassifier - 1.useAllSplitWordTriples = true
e.s.n.c.ColumnDataClassifier - useQN = true
e.s.n.c.ColumnDataClassifier - encoding = utf-8
e.s.n.c.ColumnDataClassifier - useClassFeature = true
e.s.n.c.ColumnDataClassifier - 1.binnedLengths = 10,20,30
e.s.n.c.ColumnDataClassifier - 1.minNGramLeng = 2
e.s.n.c.ColumnDataClassifier - lowercase = true
e.s.n.c.ColumnDataClassifier - intern = true
e.s.n.c.ColumnDataClassifier - 1.splitWordsRegexp = \s+
e.s.n.c.ColumnDataClassifier - goldAnswerColumn = 0
e.s.n.c.ColumnDataClassifier - 1.minWordNGramLeng = 2
e.s.n.c.ColumnDataClassifier - displayedColumn = 1
e.s.n.c.ColumnDataClassifier - printClassifierParam = 200
e.s.n.c.ColumnDataClassifier - 1.useNGrams = true
e.s.n.c.ColumnDataClassifier - QNsize = 5
e.s.n.c.ColumnDataClassifier - sigma = 3
e.s.n.c.ColumnDataClassifier - 1.useAllSplitWordPairs = true
e.s.n.c.ColumnDataClassifier - tolerance = 1e-4
e.s.n.c.ColumnDataClassifier - 1.usePrefixSuffixNGrams = true
e.s.n.c.ColumnDataClassifier - 1.useSplitWordNGrams = true
e.s.n.c.ColumnDataClassifier - 1.maxWordNGramLeng = 4
e.s.n.c.ColumnDataClassifier - 1.maxNGramLeng = 4

火车文件详细信息

e.s.n.c.Dataset - numDatums: 231049

numDatumsPerLabel: {84146000=1654.0, 84610000=76.0, 85164000=1991.0, 85171232=25.0, 94010000=4534.0, 85171231=32257.0, 85166000=224.0, 94031000=51.0, 84181000=5607.0, 85094050=456.0, 94035000=2530.0, 84184000=586.0, 84183000=466.0, 85094020=1502.0, 85161000=375.0, 85270000=2.0, 84151000=823.0, 85163100=1977.0, 85163200=1858.0, 84430000=1803.0, 85167920=597.0, 73211100=4963.0, 84145000=3369.0, 85171100=297.0, 84500000=1919.0, 85165000=1136.0, 99999999=123959.0, 94032000=184.0, 94030000=44.0, 85091000=1466.0, 85098000=85.0, 94034000=837.0, 94036000=2066.0, 85094010=2826.0, 85287200=10090.0, 84243010=945.0, 84186900=427.0, 85183000=1130.0, 84713010=11690.0, 84715010=1633.0, 94041000=1783.0, 85167910=806.0}
numLabels: 42 [99999999, 73211100, 84145000, 84146000, 84151000, 84181000, 84183000, 84184000, 84186900, 84243010, 84430000, 84500000, 84610000, 84713010, 84715010, 85091000, 85094010, 85094020, 85094050, 85098000, 85161000, 85163100, 85163200, 85164000, 85165000, 85166000, 85167910, 85167920, 85171100, 85171231, 85171232, 85183000, 85270000, 85287200, 94010000, 94030000, 94031000, 94032000, 94034000, 94035000, 94036000, 94041000]
numFeatures (Phi(X) types): 9434620 [CLASS, 1-SW#-fulano-firmar, 1-#-oli, 1-#-irma, 1-#B-rob, ...]

例外是:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at edu.stanford.nlp.optimization.QNMinimizer.minimize(QNMinimizer.java:891)
        at edu.stanford.nlp.optimization.QNMinimizer.minimize(QNMinimizer.java:856)
        at edu.stanford.nlp.optimization.QNMinimizer.minimize(QNMinimizer.java:850)
        at edu.stanford.nlp.optimization.QNMinimizer.minimize(QNMinimizer.java:93)
        at edu.stanford.nlp.classify.LinearClassifierFactory.trainWeights(LinearClassifierFactory.java:529)
        at edu.stanford.nlp.classify.LinearClassifierFactory.trainClassifier(LinearClassifierFactory.java:929)
        at edu.stanford.nlp.classify.LinearClassifierFactory.trainClassifier(LinearClassifierFactory.java:913)
        at edu.stanford.nlp.classify.ColumnDataClassifier.makeClassifier(ColumnDataClassifier.java:1482)
        at edu.stanford.nlp.classify.ColumnDataClassifier.trainClassifier(ColumnDataClassifier.java:2087)
        at com.firmar.TextClassifier.<init>(TextClassifier.java:75)
        at com.firmar.App.main(App.java:27)

TextClassifier的第75行是我的代码中的行(cdc.trainClassifier(trainFile)),我在其中尝试训练ColumnDataClassifier如下:

ColumnDataClassifier cdc = new ColumnDataClassifier(propFile);
cdc.trainClassifier(trainFile);

App只是我为了执行文本分类器所做的命令行程序。我这样称呼它:

java -Xmx10240m -jar textclassifier-1.0-jar-with-dependencies.jar ./stanford_classifier.prop ./stanford_classifier.train

因此,正如您所看到的,我给应用程序10gb的运行费用(我的服务器有12gb)。

由于在QNMinimizer上引发了异常,因此我尝试将QNSize减小为5(默认值为15),但是会发生相同的错误。是否可以更改任何参数以减少内存使用,还是需要在服务器中放入更多内存?

更新:我添加了更多内存(现在服务器具有16gb,并且应用程序以14gb运行),并且我还禁用了QN(useQN = false)。同样的错误...

0 个答案:

没有答案