我正在尝试在java中实现感知器算法,只是一种类型,而不是完全神经网络类型。这是我试图解决的分类问题。
我需要做的是为政治,科学,体育和无神论四个类别之一的每个文档创建一个词袋特征向量。 This是数据。
我正在努力实现这一目标(直接引用this question的第一个答案):
示例:的
Document 1 = ["I", "am", "awesome"]
Document 2 = ["I", "am", "great", "great"]
字典是:
["I", "am", "awesome", "great"]
因此,作为向量的文档看起来像:
Document 1 = [1, 1, 1, 0]
Document 2 = [1, 1, 0, 2]
通过这种方式,你可以做各种奇特的数学运算并将其提供给你的感知器。
我已经能够生成全球词典了,现在我需要为每个文档制作一个,但是我怎样才能保持它们的全部?文件夹结构非常简单,即`/ politics /'里面有很多文章,我需要针对全局字典制作一个特征向量。我认为我使用的迭代器让我感到困惑。
这是主要课程:
public class BagOfWords
{
static Set<String> global_dict = new HashSet<String>();
static boolean global_dict_complete = false;
static String path = "/home/Workbench/SUTD/ISTD_50.570/assignments/data/train";
public static void main(String[] args) throws IOException
{
//each of the diferent categories
String[] categories = { "/atheism", "/politics", "/science", "/sports"};
//cycle through all categories once to populate the global dict
for(int cycle = 0; cycle <= 3; cycle++)
{
String general_data_partition = path + categories[cycle];
File file = new File( general_data_partition );
Iterateur.iterateDirectory(file, global_dict, global_dict_complete);
}
//after the global dict has been filled up
//cycle through again to populate a set of
//words for each document, compare it to the
//global dict.
for(int cycle = 0; cycle <= 3; cycle++)
{
if(cycle == 3)
global_dict_complete = true;
String general_data_partition = path + categories[cycle];
File file = new File( general_data_partition );
Iterateur.iterateDirectory(file, global_dict, global_dict_complete);
}
//print the data struc
//for (String s : global_dict)
//System.out.println( s );
}
}
这会遍历数据结构:
public class Iterateur
{
static void iterateDirectory(File file,
Set<String> global_dict,
boolean global_dict_complete) throws IOException
{
for (File f : file.listFiles())
{
if (f.isDirectory())
{
iterateDirectory(file, global_dict, global_dict_complete);
}
else
{
String line;
BufferedReader br = new BufferedReader(new FileReader( f ));
while ((line = br.readLine()) != null)
{
if (global_dict_complete == false)
{
Dictionary.populate_dict(file, f, line, br, global_dict);
}
else
{
FeatureVecteur.generateFeatureVecteur(file, f, line, br, global_dict);
}
}
}
}
}
}
这填补了全球字典:
public class Dictionary
{
public static void populate_dict(File file,
File f,
String line,
BufferedReader br,
Set<String> global_dict) throws IOException
{
while ((line = br.readLine()) != null)
{
String[] words = line.split(" ");//those are your words
String word;
for (int i = 0; i < words.length; i++)
{
word = words[i];
if (!global_dict.contains(word))
{
global_dict.add(word);
}
}
}
}
}
这是填写文档特定词典的初步尝试:
public class FeatureVecteur
{
public static void generateFeatureVecteur(File file,
File f,
String line,
BufferedReader br,
Set<String> global_dict) throws IOException
{
Set<String> file_dict = new HashSet<String>();
while ((line = br.readLine()) != null)
{
String[] words = line.split(" ");//those are your words
String word;
for (int i = 0; i < words.length; i++)
{
word = words[i];
if (!file_dict.contains(word))
{
file_dict.add(word);
}
}
}
}
}
答案 0 :(得分:2)
如果我理解了您的问题,那么您将尝试计算给定文件中全局字典中每个单词的实例数。我建议创建一个整数数组,其中索引表示全局字典中的索引,值表示文件中该单词的出现次数。
然后,对于全局字典中的每个单词,计算该单词在文件中出现的次数。但是,您需要小心 - 特征向量需要元素的一致排序,而HashSet不保证这一点。例如,在你的例子中,&#34;我&#34;总是需要成为第一个元素。要解决此问题,您可能希望在全局字典完全完成后将您的集转换为ArrayList或其他顺序列表。
ArrayList<String> global_dict_list = ArrayList<String>( global_dict );
计数可能看起来像这样
int[] wordFrequency = new int[global_dict_list.size()];
for ( String globalWord : global_dict_list )
{
for ( int i = 0; i < words.length; i++ )
{
if ( words[i].equals(globalWord) )
{
wordFrequency[i]++;
}
}
}
将该代码嵌套在while循环中,该循环在特征向量代码中逐行读取。希望它有所帮助!