我尝试分别阅读一个语料库中的约400个文件,并将每个文件拆分成单词。然后将其映射到(key,value) ((fileName, word) , 1)
,但遇到java.lang.StackOverflowError。我在Apache Spark中使用Java。
for (int i = 1; i <= 400; i++) {
String fileNames = fileName + "/New" + i + ".txt";
lines = sc.textFile(fileNames);
words = lines.flatMap(s -> Arrays.asList(s.split(" ")).iterator());
JavaPairRDD<Tuple2<String, String>, Integer> file = words.mapToPair(new PairFunction<String, Tuple2<String, String>, Integer>() {
@Override
public Tuple2<Tuple2<String, String>, Integer> call(String s) throws Exception {
return new Tuple2<Tuple2<String, String>, Integer>(new Tuple2<String, String>((fileNames), s), 1);
}
});
if (finalCorpus != null) {
finalCorpus = finalCorpus.union(file);
} else {
finalCorpus = file;
}
}
什么是最佳解决方案?我有5G可用内存,对于这些数量的文件,此错误不合理。堆栈跟踪如下:
Exception in thread "main" java.lang.StackOverflowError
at java.io.ObjectStreamClass$FieldReflector.getPrimFieldValues(ObjectStreamClass.java:2002)
at java.io.ObjectStreamClass.getPrimFieldValues(ObjectStreamClass.java:1277)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1533)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468)
at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468)
at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)