Question

我试图根据它们的相似性（在两个单词之间）聚类一些单词我的数据的一部分如下（它只是示例“animal.txt”，它与邻接矩阵类似）。

    cat dog horse ostrich 
cat  5    4    3    2
dog  4    5    1    2
horse 3   1    5    4
ostrich 2  2   4    5

数字越大意味着两个单词之间的相似性越高。根据这种格式数据，我想制作一个集群。（例如，如果我想制作2个簇，那么结果将是（猫，狗），（马，鸵鸟））。

我尝试使用CLUTO ......制作一些群集。

首先，我必须在进行CLUTO群集之前重新构造输入文件。所以，我使用了doc2mat（http://glaros.dtc.umn.edu/gkhome/files/fs/sw/cluto/doc2mat.html）..但我不知道如何正确地使用它来制作CLUTO输入文件（如mat，标签文件）并且在制作CLUTO输入文件后，我怎么能根据以上数据制作集群？

Answer 1

由于您的数据是邻接矩阵，因此相应的CLUTO输入文件是所谓的 GraphFile ，而不是 MatrixFile ，因此doc2mat不会＆＃ 39; t help。

此程序txt2graph.pl转换文件，如您的示例＆＃34; animal.txt＆＃34;到图形文件和行标签文件：

#!/usr/bin/perl
@F = split ' ', <>;             # begin reading txt file, read column headers
($GraphFile = $ARGV) =~ s/(.txt)?$/.graph/;
$LabelFile = $GraphFile.".rlabel";
open LABEL, ">$LabelFile";
open GRAPH, ">$GraphFile";
print GRAPH $#F+1, "\n";        # output number of vertices=objects=columns=rows
while (<>)
{                               # process each object row
    @F = split ' ', $_, 2;      # split into name, numbers
    print LABEL shift @F, "\n"; # output name
    print GRAPH @F;             # output numbers
}

CLUTO群集完成后，此程序pclusters.pl以您想要的输出格式打印结果：

#!/usr/bin/perl
($LabelFile = $ARGV[0]) =~ s/(.clustering.\d+)?$/.rlabel/;
open LABEL, $LabelFile; chomp(@label = <LABEL>); close LABEL;   # read labels
while (<>)
{
    $cluster[$_] = [] unless $cluster[$_];      # initialize a new cluster
    push $cluster[$_], $label[$.-1];            # add label to its cluster
}
foreach $cluster (@cluster)
{
    print "(", join(', ', @$cluster), ")\n";    # print a cluster's labels
}

整个程序是：

> txt2graph.pl animal.txt
> scluster animal.graph 2
> pclusters.pl animal.graph.clustering.2

使用CLUTO进行群集时输入数据的数据预处理

1 个答案: