我正在尝试用Naive Bayes在Weka中编写一个文本分类器。我有一个Foursquare提示作为训练数据的集合,其中近500个标记为正数,并且在excel文件中标记为负数大致相同。输入文件有两列,第一列是尖端文本,第二列是标记极性。我正在使用AFINN-111.txt添加属性来增强输出。它计算该提示中的所有极性词,并给出所有词的最终得分。这是我的整个代码:
public class DataReader {
static Map<String, Integer> affinMap = new HashMap<String, Integer>();
public List<List<Object>> createAttributeList() {
ClassLoader classLoader = getClass().getClassLoader();
initializeAFFINMap(classLoader);
File inputWorkbook = new File(classLoader
.getResource("Tip_dataset2.xls").getFile());
Workbook w;
Sheet sheet = null;
try {
w = Workbook.getWorkbook(inputWorkbook);
// Get the first sheet
sheet = w.getSheet(0);
} catch (Exception e) {
e.printStackTrace();
}
List<List<Object>> attributeList = new ArrayList<List<Object>>();
for (int i = 1; i < sheet.getRows(); i++) {
String tip = sheet.getCell(0, i).getContents();
tip = tip.replaceAll("'", "");
tip = tip.replaceAll("\"", "");
tip = tip.replaceAll("%", " percent");
tip = tip.replaceAll("@", " ATAUTHOR");
String polarity = getPolarity(sheet.getCell(1, i).getContents());
int affinScore = 0;
String[] arr = tip.split(" ");
for (int j = 0; j < arr.length; j++) {
if (affinMap.containsKey(arr[j].toLowerCase())) {
affinScore = affinScore
+ affinMap.get(arr[j].toLowerCase());
}
}
List<Object> attrs = new ArrayList<Object>();
attrs.add(tip);
attrs.add(affinScore);
attrs.add(polarity);
attributeList.add(attrs);
}
return attributeList;
}
private String getPolarity(String cell) {
if (cell.equalsIgnoreCase("positive")) {
return "positive";
} else {
return "negative";
}
}
private void initializeAFFINMap(ClassLoader classLoader) {
try {
InputStream stream = classLoader
.getResourceAsStream("AFINN-111.txt");
DataInputStream in = new DataInputStream(stream);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String str;
while ((str = br.readLine()) != null) {
String[] array = str.split("\t");
affinMap.put(array[0], Integer.parseInt(array[1]));
}
in.close();
} catch (Exception e) {
e.printStackTrace();
}
}
public static void main(String[] args) throws Exception {
List<List<Object>> attrList=new DataReader().createAttributeList();
new CreateTrainedModel().createTrainingData(attrList);
}
}
这是实际的分类器类:
public class CreateTrainedModel {
public void createTrainingData(List<List<Object>> attrList)
throws Exception {
Attribute tip = new Attribute("tip", (FastVector) null);
Attribute affin = new Attribute("affinScore");
FastVector pol = new FastVector(2);
pol.addElement("positive");
pol.addElement("negative");
Attribute polaritycl = new Attribute("polarity", pol);
FastVector inputDataDesc = new FastVector(3);
inputDataDesc.addElement(tip);
inputDataDesc.addElement(affin);
inputDataDesc.addElement(polaritycl);
Instances dataSet = new Instances("dataset", inputDataDesc,
attrList.size());
// Set class index
dataSet.setClassIndex(2);
for (List<Object> onList : attrList) {
Instance in = new Instance(3);
in.setValue((Attribute) inputDataDesc.elementAt(0), onList.get(0)
.toString());
in.setValue((Attribute) inputDataDesc.elementAt(1),
Integer.parseInt(onList.get(1).toString()));
in.setValue((Attribute) inputDataDesc.elementAt(2), onList.get(2)
.toString());
dataSet.add(in);
}
Filter f = new StringToWordVector();
f.setInputFormat(dataSet);
dataSet = Filter.useFilter(dataSet, f);
Classifier model = (Classifier) new NaiveBayes();
try {
model.buildClassifier(dataSet);
} catch (Exception e1) { // TODO Auto-generated catch block
e1.printStackTrace();
}
ObjectOutputStream oos = new ObjectOutputStream(new FileOutputStream(
"FS-TipsNaiveBayes.model"));
oos.writeObject(model);
oos.flush();
oos.close();
FastVector fvWekaAttributes1 = new FastVector(3);
fvWekaAttributes1.addElement(tip);
fvWekaAttributes1.addElement(affin);
Instance in = new Instance(3);
in.setValue((Attribute) fvWekaAttributes1.elementAt(0),
"burger here is good");
in.setValue((Attribute) fvWekaAttributes1.elementAt(1), 0);
Instances testSet = new Instances("dataset", fvWekaAttributes1, 1);
in.setDataset(testSet);
double[] fDistribution = model.distributionForInstance(in);
System.out.println(fDistribution);
}
}
我面临的问题是任何输入,输出分布始终在[0.52314376998377, 0.47685623001622995]
范围内。它总是更倾向于积极而不是消极。这些数字并没有大幅改变。知道我做错了什么吗?
答案 0 :(得分:0)
我没有阅读你的代码,但我可以说的一件事是AFFIN分数在一定范围之间归一化。如果您的输出更接近正范围,那么您需要更改分类成本函数,因为它会过度拟合您的数据。