如何使用地图创建三元组

时间:2016-10-04 06:31:35

标签: java information-retrieval

我想在java中构建一个倒排索引。我有1400个文本文件的数据。我能够计算每个术语/单词的频率。我已经能够返回整个集合中出现的单词的次数,但是我无法创建三元组(t,d,f),其中t = term,d = doc,f = frequency。这是我到目前为止的代码:

我希望以下列形式输出

term1: doc1:2, 
term2: doc2:3, 
term1: doc3:1 

此处术语是doc文件中的单词,doc 1:2表示term1出现在doc 1中2次

public static void main(String[]args) throws FileNotFoundException{
        Map<String, Integer> m = new HashMap<>();
       
        String wrd;
        
        for(int i=1;i<=2;i++){
           //FileInputStream tdfr = new FileInputStream("D:\\logs\\steem"+i+".txt");
           Scanner tdsc=new Scanner(new File("D:\\logs\\steem"+i+".txt"));
           while(tdsc.hasNext()){
              // m.clear();
              Integer docid=i;
              
               wrd=tdsc.next();
               //Vector<Integer> vPosList = p.hPosList.get(wrd);
               Integer freq=m.get(wrd);
               
               //Integer doc=m1.get(i);
              //System.out.println(m.get(wrd));
               m.put(wrd, (freq == null) ? 1 : freq + 1);
          
              
               
               
               
           }
        
               
             
          System.out.println(m.size() + " distinct words" + " steem" +i);
          System.out.println("Doc" +i+""+m);
          //System.out.println("Doc"+i+""+m1);
          m.clear();
        
          
        tdsc.close();
        
    }
        //System.out.println(m.size() + " distinct words");
        //System.out.println(m);
       // System.out.println(m1);
        
}
}

public static void main(String[]args) throws FileNotFoundException{
        Map<String, Integer> m = new HashMap<>();

        String wrd;

        for(int i=1;i<=2;i++){
           //FileInputStream tdfr = new FileInputStream("D:\\logs\\steem"+i+".txt");
           Scanner tdsc=new Scanner(new File("D:\\logs\\steem"+i+".txt"));
           while(tdsc.hasNext()){
              // m.clear();
              Integer docid=i;

               wrd=tdsc.next();
               //Vector<Integer> vPosList = p.hPosList.get(wrd);
               Integer freq=m.get(wrd);

               //Integer doc=m1.get(i);
              //System.out.println(m.get(wrd));
               m.put(wrd, (freq == null) ? 1 : freq + 1);





           }



          System.out.println(m.size() + " distinct words" + " steem" +i);
          System.out.println("Doc" +i+""+m);
          //System.out.println("Doc"+i+""+m1);
          m.clear();


        tdsc.close();

    }
        //System.out.println(m.size() + " distinct words");
        //System.out.println(m);
       // System.out.println(m1);

}
}

1 个答案:

答案 0 :(得分:0)

如何将其存储为List<Map<String,Integer>>? 为每个文档创建一个新的Map,将该术语与其频率进行映射。

  List<Map<String, Integer>> list = new ArrayList<>();
    Map<String, Integer> map;
    String word;
    //Iterate over documents
    for (int i = 1; i <= 2; i++) {
        map = new HashedMap<>();
        Scanner tdsc = new Scanner(new File("D:\\logs\\steem" + i + ".txt"));
        //Iterate over words
        while (tdsc.hasNext()) {
            word = tdsc.next();
            final Integer freq = map.get(word);
            if (freq == null) {
                map.put(word, 1);
            } else {
                map.put(word, map.get(word) + 1);
            }
        }
        list.add(map);
    }

    //Print result
    int documentNumber = 0;
    for (Map<String, Integer> document : list) {
        for (Map.Entry<String, Integer> entry : document.entrySet()) {
            System.out.println(entry.getKey() + ":doc"+documentNumber+":" + entry.getValue());
        }
        documentNumber++;
    }