我想在java中构建一个倒排索引。我有1400个文本文件的数据。我能够计算每个术语/单词的频率。我已经能够返回整个集合中出现的单词的次数,但是我无法创建三元组(t,d,f),其中t = term,d = doc,f = frequency。这是我到目前为止的代码:
我希望以下列形式输出
term1: doc1:2,
term2: doc2:3,
term1: doc3:1
此处术语是doc文件中的单词,doc 1:2表示term1出现在doc 1中2次
public static void main(String[]args) throws FileNotFoundException{
Map<String, Integer> m = new HashMap<>();
String wrd;
for(int i=1;i<=2;i++){
//FileInputStream tdfr = new FileInputStream("D:\\logs\\steem"+i+".txt");
Scanner tdsc=new Scanner(new File("D:\\logs\\steem"+i+".txt"));
while(tdsc.hasNext()){
// m.clear();
Integer docid=i;
wrd=tdsc.next();
//Vector<Integer> vPosList = p.hPosList.get(wrd);
Integer freq=m.get(wrd);
//Integer doc=m1.get(i);
//System.out.println(m.get(wrd));
m.put(wrd, (freq == null) ? 1 : freq + 1);
}
System.out.println(m.size() + " distinct words" + " steem" +i);
System.out.println("Doc" +i+""+m);
//System.out.println("Doc"+i+""+m1);
m.clear();
tdsc.close();
}
//System.out.println(m.size() + " distinct words");
//System.out.println(m);
// System.out.println(m1);
}
}
public static void main(String[]args) throws FileNotFoundException{
Map<String, Integer> m = new HashMap<>();
String wrd;
for(int i=1;i<=2;i++){
//FileInputStream tdfr = new FileInputStream("D:\\logs\\steem"+i+".txt");
Scanner tdsc=new Scanner(new File("D:\\logs\\steem"+i+".txt"));
while(tdsc.hasNext()){
// m.clear();
Integer docid=i;
wrd=tdsc.next();
//Vector<Integer> vPosList = p.hPosList.get(wrd);
Integer freq=m.get(wrd);
//Integer doc=m1.get(i);
//System.out.println(m.get(wrd));
m.put(wrd, (freq == null) ? 1 : freq + 1);
}
System.out.println(m.size() + " distinct words" + " steem" +i);
System.out.println("Doc" +i+""+m);
//System.out.println("Doc"+i+""+m1);
m.clear();
tdsc.close();
}
//System.out.println(m.size() + " distinct words");
//System.out.println(m);
// System.out.println(m1);
}
}
答案 0 :(得分:0)
如何将其存储为List<Map<String,Integer>>
?
为每个文档创建一个新的Map,将该术语与其频率进行映射。
List<Map<String, Integer>> list = new ArrayList<>();
Map<String, Integer> map;
String word;
//Iterate over documents
for (int i = 1; i <= 2; i++) {
map = new HashedMap<>();
Scanner tdsc = new Scanner(new File("D:\\logs\\steem" + i + ".txt"));
//Iterate over words
while (tdsc.hasNext()) {
word = tdsc.next();
final Integer freq = map.get(word);
if (freq == null) {
map.put(word, 1);
} else {
map.put(word, map.get(word) + 1);
}
}
list.add(map);
}
//Print result
int documentNumber = 0;
for (Map<String, Integer> document : list) {
for (Map.Entry<String, Integer> entry : document.entrySet()) {
System.out.println(entry.getKey() + ":doc"+documentNumber+":" + entry.getValue());
}
documentNumber++;
}