Java hashmap到矩阵转换

时间:2018-02-13 20:13:33

标签: java matrix hashmap sparse-matrix

我启动了一个hashmap和一个嵌套的hashmap来存储术语,它的出现次数和频率。

for (i = 1; i < lineTokens.length; i += 2) 
{   
    if (i + 1 >= lineTokens.length) continue;  
    String fileName = lineTokens[i];
    int frequency = Integer.parseInt(lineTokens[i + 1]); 
    postingList2.put(fileName,frequency);
    //System.out.println(postingList2);
}
postingList.put(topic, postingList2);

它为我输出:{cancel = {WET4793.txt = 16,WET5590.txt = 53},不可用= {WET4291.txt = 10},电台信息= {WET2266.txt = 32},宣传计划= { WET2776.txt = 32},没有ratinglogin = {WET5376.txt = 76}, 我试图用矩阵表示整个事物。但我不能将0设置为不包含特定术语的文件。 它就像:

row-> term
column -> document
mat[row][column]= frequency of occurances of terms in the document.

我使用pandas dataframe在python中轻松完成了它。

1 个答案:

答案 0 :(得分:1)

鉴于您的初始HashMap,转换为Matrix需要三个步骤

  1. 为每个主题(0,1 ...)
  2. 创建唯一索引ID
  3. 为每个文档(0,1,..)
  4. 创建唯一索引ID
  5. 使用上述索引填充矩阵
  6. 此解决方案将使用Map查找(键是发布/文档)以提高效率。可以控制过帐/文件的顺序;这里没有尝试创建特定的订单。

    第1步:为帖子创建唯一ID并创建查找地图

    Map<String, Integer> topicIndex = new HashMap<>();
    List<String> topicList = new ArrayList<>();  // topicList is used to print the matrix
    int index = 0;
    for (String topic : postingList.keySet()) {
        if (!topicIndex.containsKey(topic)) {
            topicIndex.put(topic, index++);
            topicList.add(topic);
        }
    }
    

    此地图的结果是(所有字词现在都有唯一ID):

    Topics: {cancel=0, unavailable=1, station info=2, advocacy program=3, no ratingslogin=4}
    

    第2步:为文档创建唯一ID并创建查找地图

    index = 0;
    Map<String, Integer> documentIndex = new HashMap<>();
    for (String topic : postingList.keySet()) {
        for (String document : postingList.get(topic).keySet()) {
            if (!documentIndex.containsKey(document))
                documentIndex.put(document, index++);
        }
    }
    

    此Map的结果是(所有文档现在都有唯一的ID):

    Documents: {WET4793.txt=0, WET4291.txt=2, WET2266.txt=3, WET2776.txt=4, WET5376.txt=5, WET5590.txt=1}
    

    第3步:创建并填充矩阵

    int[][] mat = new int[topicIndex.size()][documentIndex.size()];
    for (String topic : postingList.keySet()) {
        for (String document : postingList.get(topic).keySet()) {
            mat[topicIndex.get(topic)][documentIndex.get(document)] = postingList.get(topic).get(document);
        }
    }
    

    结果:矩阵现在看起来像这样:

    cancel          16 53  0  0  0  0 
    unavailable      0  0 10  0  0  0 
    station info     0  0  0 32  0  0 
    advocacy program 0  0  0  0 32  0 
    no ratingslogin  0  0  0  0  0 76 
    

    编辑:循环打印矩阵

        for (int row = 0; row < topicIndex.size(); row++) {
            System.out.printf("%-16s", topicList.get(row));
            for (int col = 0; col < documentIndex.size(); col++) {
                System.out.printf("%2d ", mat[row][col]);
            }
            System.out.println();
        }