我想使用spark来处理大文本,文本中的单词顺序很重要,应该保留。
我尝试了以下方法,但对于大型文本却没有成功。它有内存不足问题。(显然!)
hadoopConf.set("textinputformat.record.delimiter", "$$$$$");//read whole text
JavaRDD<String> texts = sparkContext.newAPIHadoopFile(inputFile, TextInputFormat.class, LongWritable.class, Text.class, hadoopConf).values().map((x) -> x.toString());
JavaRDD<Tuple2<String, Integer>> lines = texts.flatMap(new readDocuments());
public class readDocuments implements FlatMapFunction<String, Tuple2<String, Integer>> {
private static final long serialVersionUID = 1L;
// line , index in original text
@Override
public Iterator<Tuple2<String, Integer>> call(String text) throws Exception {
List<Tuple2<String, Integer>> lines = new ArrayList<Tuple2<String, Integer>>();
String[] tempLines = text.split("\n");
for (int i = 0; i < tempLines.length; i++) {
if (tempLines[i].length() > 0)
lines.add(new Tuple2<String, Integer>(tempLines[i], i));
}
return lines.iterator();
}
}
我还尝试使用sparkContext.wholeTextFiles(inputFile)
读取文件,但同样的问题!
任何想法或提示都会非常感激。