Question

我必须读取几个文件并索引文件中的每个单词。索引时我必须遵循以下格式：

要求==＆gt;字，{d1，tf1，d2，tf2，d4，tf4}，someOtherValue

说明：

         1)word = any word in the files

         2)d1,d2,d4... are fileId

         3) tf1,tf2,tf4....are the number of times the word appears
            in d1,d2,d4 respectievly

我创建了一个班级＆＃34;令牌＆＃34;其中包含来自不同文件的单词作为＆＃39; String token＆＃39; ，它所属的文件的名称为＆＃39; String fileId＆＃39;及其在文件中的频率为＆＃39; Int count＆＃39;。

我可以轻松检查1个文件中的各种单词并更新其计数。我使用arrayList来做到这一点。但是当同一个单词出现在另一个文件中时，如何在索引时附加fileId及其计数。

Answer 1

我会创建一个

class RefCount {
    String fileId;
    int count;
    RefCount( fileId ){
        this.fileId = fileId;
        count = 1;
    }
    void increment(){
        count++;
    }
    // more...
}

类标记应该是

class Token {
    String word;
    List<RefCount> references;
    ...

    public void countWord( String fileId ){
        int last = references.size() - 1;
        if( last >= 0 ){
            RefCount rc =  references.get(last);
            if( equals(fileId) ){
                rc.increment();
                return;
            }
        }
        references.add( fileId );
    }
    // more...
}

这假设您要逐个文件添加引用，因此只需要检查最后一个文件ID以确定我们是否仍在同一个文件中。

您应该使用Map<String,Token>而不是列表。

编辑要显示结果，您可以迭代所有标记的地图或列表，然后迭代RefCount对象列表：

for( Token token: tokenList ){
    System.out.print( token.getWord() + ":" );
    for( RefCount refCount: token.getReferences() ){
        System.out.print( " " + refCount.getFileId() +
                          "*" + refCount.getCount() );
    }
    System.out.println();
}

您可能希望在每第n个id / count对之后终止一行。

对许多文件进行标记和索引

1 个答案: