Weka - 朴素贝叶斯总是给出边界结果

时间:2015-03-28 18:55:00

标签: nlp weka text-classification

我正在尝试用Naive Bayes在Weka中编写一个文本分类器。我有一个Foursquare提示作为训练数据的集合,其中近500个标记为正数,并且在excel文件中标记为负数大致相同。输入文件有两列,第一列是尖端文本,第二列是标记极性。我正在使用AFINN-111.txt添加属性来增强输出。它计算该提示中的所有极性词,并给出所有词的最终得分。这是我的整个代码:

    public class DataReader {

    static Map<String, Integer> affinMap = new HashMap<String, Integer>();

    public List<List<Object>> createAttributeList() {
        ClassLoader classLoader = getClass().getClassLoader();
        initializeAFFINMap(classLoader);
        File inputWorkbook = new File(classLoader
                .getResource("Tip_dataset2.xls").getFile());
        Workbook w;
        Sheet sheet = null;
        try {
            w = Workbook.getWorkbook(inputWorkbook);
            // Get the first sheet
            sheet = w.getSheet(0);
        } catch (Exception e) {
            e.printStackTrace();
        }
        List<List<Object>> attributeList = new ArrayList<List<Object>>();
        for (int i = 1; i < sheet.getRows(); i++) {
            String tip = sheet.getCell(0, i).getContents();

            tip = tip.replaceAll("'", "");
            tip = tip.replaceAll("\"", "");
            tip = tip.replaceAll("%", " percent");
            tip = tip.replaceAll("@", " ATAUTHOR");
            String polarity = getPolarity(sheet.getCell(1, i).getContents());
            int affinScore = 0;
            String[] arr = tip.split(" ");
            for (int j = 0; j < arr.length; j++) {
                if (affinMap.containsKey(arr[j].toLowerCase())) {
                    affinScore = affinScore
                            + affinMap.get(arr[j].toLowerCase());
                }
            }
            List<Object> attrs = new ArrayList<Object>();
            attrs.add(tip);
            attrs.add(affinScore);
            attrs.add(polarity);

            attributeList.add(attrs);
        }
        return attributeList;
    }

    private String getPolarity(String cell) {
        if (cell.equalsIgnoreCase("positive")) {
            return "positive";
        } else {
            return "negative";
        }
    }

    private void initializeAFFINMap(ClassLoader classLoader) {
        try {
            InputStream stream = classLoader
                    .getResourceAsStream("AFINN-111.txt");
            DataInputStream in = new DataInputStream(stream);
            BufferedReader br = new BufferedReader(new InputStreamReader(in));
            String str;
            while ((str = br.readLine()) != null) {
                String[] array = str.split("\t");
                affinMap.put(array[0], Integer.parseInt(array[1]));
            }
            in.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    public static void main(String[] args) throws Exception {
        List<List<Object>> attrList=new DataReader().createAttributeList();
        new CreateTrainedModel().createTrainingData(attrList);
    }

}

这是实际的分类器类:

public class CreateTrainedModel {

    public void createTrainingData(List<List<Object>> attrList)
            throws Exception {

        Attribute tip = new Attribute("tip", (FastVector) null);
        Attribute affin = new Attribute("affinScore");

        FastVector pol = new FastVector(2);
        pol.addElement("positive");
        pol.addElement("negative");
        Attribute polaritycl = new Attribute("polarity", pol);

        FastVector inputDataDesc = new FastVector(3);
        inputDataDesc.addElement(tip);
        inputDataDesc.addElement(affin);
        inputDataDesc.addElement(polaritycl);

        Instances dataSet = new Instances("dataset", inputDataDesc,
                attrList.size());
        // Set class index
        dataSet.setClassIndex(2);

        for (List<Object> onList : attrList) {
            Instance in = new Instance(3);
            in.setValue((Attribute) inputDataDesc.elementAt(0), onList.get(0)
                    .toString());
            in.setValue((Attribute) inputDataDesc.elementAt(1),
                    Integer.parseInt(onList.get(1).toString()));
            in.setValue((Attribute) inputDataDesc.elementAt(2), onList.get(2)
                    .toString());

            dataSet.add(in);
        }

        Filter f = new StringToWordVector();
        f.setInputFormat(dataSet);
        dataSet = Filter.useFilter(dataSet, f);

        Classifier model = (Classifier) new NaiveBayes();
        try {
            model.buildClassifier(dataSet);
        } catch (Exception e1) { // TODO Auto-generated catch block
            e1.printStackTrace();
        }

        ObjectOutputStream oos = new ObjectOutputStream(new FileOutputStream(
                "FS-TipsNaiveBayes.model"));
        oos.writeObject(model);
        oos.flush();
        oos.close();

        FastVector fvWekaAttributes1 = new FastVector(3);
        fvWekaAttributes1.addElement(tip);
        fvWekaAttributes1.addElement(affin);

        Instance in = new Instance(3);
        in.setValue((Attribute) fvWekaAttributes1.elementAt(0),
                "burger here is good");
        in.setValue((Attribute) fvWekaAttributes1.elementAt(1), 0);

        Instances testSet = new Instances("dataset", fvWekaAttributes1, 1);
        in.setDataset(testSet);

        double[] fDistribution = model.distributionForInstance(in);
        System.out.println(fDistribution);

    }

}

我面临的问题是任何输入,输出分布始终在[0.52314376998377, 0.47685623001622995]范围内。它总是更倾向于积极而不是消极。这些数字并没有大幅改变。知道我做错了什么吗?

1 个答案:

答案 0 :(得分:0)

我没有阅读你的代码,但我可以说的一件事是AFFIN分数在一定范围之间归一化。如果您的输出更接近正范围,那么您需要更改分类成本函数,因为它会过度拟合您的数据。