我启动了一个hashmap和一个嵌套的hashmap来存储术语,它的出现次数和频率。
for (i = 1; i < lineTokens.length; i += 2)
{
if (i + 1 >= lineTokens.length) continue;
String fileName = lineTokens[i];
int frequency = Integer.parseInt(lineTokens[i + 1]);
postingList2.put(fileName,frequency);
//System.out.println(postingList2);
}
postingList.put(topic, postingList2);
它为我输出:{cancel = {WET4793.txt = 16,WET5590.txt = 53},不可用= {WET4291.txt = 10},电台信息= {WET2266.txt = 32},宣传计划= { WET2776.txt = 32},没有ratinglogin = {WET5376.txt = 76}, 我试图用矩阵表示整个事物。但我不能将0设置为不包含特定术语的文件。 它就像:
row-> term
column -> document
mat[row][column]= frequency of occurances of terms in the document.
我使用pandas dataframe在python中轻松完成了它。
答案 0 :(得分:1)
鉴于您的初始HashMap,转换为Matrix需要三个步骤
此解决方案将使用Map查找(键是发布/文档)以提高效率。可以控制过帐/文件的顺序;这里没有尝试创建特定的订单。
第1步:为帖子创建唯一ID并创建查找地图
Map<String, Integer> topicIndex = new HashMap<>();
List<String> topicList = new ArrayList<>(); // topicList is used to print the matrix
int index = 0;
for (String topic : postingList.keySet()) {
if (!topicIndex.containsKey(topic)) {
topicIndex.put(topic, index++);
topicList.add(topic);
}
}
此地图的结果是(所有字词现在都有唯一ID):
Topics: {cancel=0, unavailable=1, station info=2, advocacy program=3, no ratingslogin=4}
第2步:为文档创建唯一ID并创建查找地图
index = 0;
Map<String, Integer> documentIndex = new HashMap<>();
for (String topic : postingList.keySet()) {
for (String document : postingList.get(topic).keySet()) {
if (!documentIndex.containsKey(document))
documentIndex.put(document, index++);
}
}
此Map的结果是(所有文档现在都有唯一的ID):
Documents: {WET4793.txt=0, WET4291.txt=2, WET2266.txt=3, WET2776.txt=4, WET5376.txt=5, WET5590.txt=1}
第3步:创建并填充矩阵
int[][] mat = new int[topicIndex.size()][documentIndex.size()];
for (String topic : postingList.keySet()) {
for (String document : postingList.get(topic).keySet()) {
mat[topicIndex.get(topic)][documentIndex.get(document)] = postingList.get(topic).get(document);
}
}
结果:矩阵现在看起来像这样:
cancel 16 53 0 0 0 0
unavailable 0 0 10 0 0 0
station info 0 0 0 32 0 0
advocacy program 0 0 0 0 32 0
no ratingslogin 0 0 0 0 0 76
编辑:循环打印矩阵
for (int row = 0; row < topicIndex.size(); row++) {
System.out.printf("%-16s", topicList.get(row));
for (int col = 0; col < documentIndex.size(); col++) {
System.out.printf("%2d ", mat[row][col]);
}
System.out.println();
}