我有这段代码,阅读文档语料库,并将它们放在java数据结构中。
public List<List<Integer>> corpus;
corpus = new ArrayList<List<Integer>>();
numDocuments = 0;
numWordsInCorpus = 0;
BufferedReader br = null;
try {
int indexSentence = -1;
int indexWord = -1;
br = new BufferedReader(new FileReader(pathToCorpus));
for (String doc; (doc = br.readLine()) != null;) {
if (doc.trim().length() == 0)
continue;
List<List<Integer>> document = new ArrayList<List<Integer>>();
String [] sentenceStrs = doc.split("\t");
for(String sentenceStr: sentenceStrs){
List<Integer> sntence = new ArrayList<Integer>();
indexSentence += 1;
String[] words = sentenceStr.trim().split(" ");
for (String word : words) {
if (word2IdVocabulary.containsKey(word)) {
sntence.add(word2IdVocabulary.get(word));
}
else {
indexWord += 1;
word2IdVocabulary.put(word, indexWord);
id2WordVocabulary.put(indexWord, word);
sntence.add(indexWord);
}
}
document.add(sntence);
numWordsInCorpus += sntence.size();
}
numDocuments++;
corpus.addAll(document);
}
}
我需要获取语料库中的每个文档,并且每个文档都要读取每个单词索引。 当语料库得到数千个文档时,如何通过快速运行来遍历这些嵌套列表?