数据结构混淆了java中感知器的实现

时间:2015-02-16 14:54:26

标签: java data-structures machine-learning perceptron

我正在尝试在java中实现感知器算法,只是一种类型,而不是完全神经网络类型。这是我试图解决的分类问题。

我需要做的是为政治,科学,体育和无神论四个类别之一的每个文档创建一个词袋特征向量。 This是数据。

我正在努力实现这一目标(直接引用this question的第一个答案):

示例:

Document 1 = ["I", "am", "awesome"]
Document 2 = ["I", "am", "great", "great"]

字典是:

["I", "am", "awesome", "great"]

因此,作为向量的文档看起来像:

Document 1 = [1, 1, 1, 0]
Document 2 = [1, 1, 0, 2]

通过这种方式,你可以做各种奇特的数学运算并将其提供给你的感知器。

我已经能够生成全球词典了,现在我需要为每个文档制作一个,但是我怎样才能保持它们的全部?文件夹结构非常简单,即`/ politics /'里面有很多文章,我需要针对全局字典制作一个特征向量。我认为我使用的迭代器让我感到困惑。

这是主要课程:

public class BagOfWords 
{
    static Set<String> global_dict = new HashSet<String>();

    static boolean global_dict_complete = false; 

    static String path = "/home/Workbench/SUTD/ISTD_50.570/assignments/data/train";

    public static void main(String[] args) throws IOException 
    {
        //each of the diferent categories
        String[] categories = { "/atheism", "/politics", "/science", "/sports"};

        //cycle through all categories once to populate the global dict
        for(int cycle = 0; cycle <= 3; cycle++)
        {
            String general_data_partition = path + categories[cycle]; 

            File file = new File( general_data_partition );
            Iterateur.iterateDirectory(file, global_dict, global_dict_complete);
        }   

        //after the global dict has been filled up
        //cycle through again to populate a set of
        //words for each document, compare it to the
        //global dict. 
        for(int cycle = 0; cycle <= 3; cycle++)
        {
            if(cycle == 3)
                global_dict_complete = true;

            String general_data_partition = path + categories[cycle]; 

            File file = new File( general_data_partition );
            Iterateur.iterateDirectory(file, global_dict, global_dict_complete);
        }

        //print the data struc              
        //for (String s : global_dict)
            //System.out.println( s );
    }
}

这会遍历数据结构:

public class Iterateur 
{
    static void iterateDirectory(File file, 
                             Set<String> global_dict, 
                             boolean global_dict_complete) throws IOException 
    {
        for (File f : file.listFiles()) 
        {
            if (f.isDirectory()) 
            {
                iterateDirectory(file, global_dict, global_dict_complete);
            } 
            else 
            {
                String line; 
                BufferedReader br = new BufferedReader(new FileReader( f ));

                while ((line = br.readLine()) != null) 
                {
                    if (global_dict_complete == false)
                    {
                        Dictionary.populate_dict(file, f, line, br, global_dict);
                    }
                    else
                    {
                        FeatureVecteur.generateFeatureVecteur(file, f, line, br, global_dict);
                    }
                }
            }
        }
    }
}

这填补了全球字典:

public class Dictionary 
{

    public static void populate_dict(File file, 
                                 File f, 
                                 String line, 
                                 BufferedReader br, 
                                 Set<String> global_dict) throws IOException
    {

        while ((line = br.readLine()) != null) 
        {
            String[] words = line.split(" ");//those are your words

            String word;

            for (int i = 0; i < words.length; i++) 
            {
                word = words[i];
                if (!global_dict.contains(word))
                {
                    global_dict.add(word);
                }
            }   
        }
    }
}

这是填写文档特定词典的初步尝试:

public class FeatureVecteur 
{
    public static void generateFeatureVecteur(File file, 
                                          File f, 
                                          String line, 
                                          BufferedReader br, 
                                          Set<String> global_dict) throws IOException
    {
        Set<String> file_dict = new HashSet<String>();

        while ((line = br.readLine()) != null) 
        {

            String[] words = line.split(" ");//those are your words

            String word;

            for (int i = 0; i < words.length; i++) 
            {
                word = words[i];
                if (!file_dict.contains(word))
                {
                    file_dict.add(word);
                }
            }   
        }
    }
}

1 个答案:

答案 0 :(得分:2)

如果我理解了您的问题,那么您将尝试计算给定文件中全局字典中每个单词的实例数。我建议创建一个整数数组,其中索引表示全局字典中的索引,值表示文件中该单词的出现次数。

然后,对于全局字典中的每个单词,计算该单词在文件中出现的次数。但是,您需要小心 - 特征向量需要元素的一致排序,而HashSet不保证这一点。例如,在你的例子中,&#34;我&#34;总是需要成为第一个元素。要解决此问题,您可能希望在全局字典完全完成后将您的集转换为ArrayList或其他顺序列表。

ArrayList<String> global_dict_list = ArrayList<String>( global_dict );

计数可能看起来像这样

int[] wordFrequency = new int[global_dict_list.size()];

for ( String globalWord : global_dict_list )
{
    for ( int i = 0; i < words.length; i++ ) 
    {
         if ( words[i].equals(globalWord) ) 
         {
             wordFrequency[i]++;
         }
    }
}

将该代码嵌套在while循环中,该循环在特征向量代码中逐行读取。希望它有所帮助!