Mallet ML Library为不同的实例打印出相同的结果

时间:2017-07-04 01:20:36

标签: java machine-learning nlp classification mallet

我想知道为什么Mallet Classification Model给出相同的输出,即使我的实例彼此完全不同。

我更改了CSV2Classify中的代码,因此它只打印出前10个标签及其置信度分数。我还打印出每个实例的统计数据,这样我就可以确定它是否正常工作。但是,我不认为代码是问题,因为Mallet似乎使用相同的标签对大多数实例进行分类。不过,下面是我在CSV2Classify中更改的代码:

1.以最自信的方式获取地点:

public static int[] getTopLocations(Labeling labeling, int numberOfCategories) {
    double[] values = new double[labeling.numLocations()];
    for (int location = 0; location < labeling.numLocations(); location++) {
        values[location] = labeling.valueAtLocation(location);          
    }
    int[] outputLocations = indexesOfTopElements(values, numberOfCategories);
    return outputLocations;
}

private static int[] indexesOfTopElements(double[] orig, int nummax) {
    double[] copy = Arrays.copyOf(orig,orig.length);
    Arrays.sort(copy);
    double[] honey = Arrays.copyOfRange(copy,copy.length - nummax, copy.length);
    int[] result = new int[nummax];
    int resultPos = 0;
    for(int i = 0; i < orig.length; i++) {
        double onTrial = orig[i];
        int index = Arrays.binarySearch(honey,onTrial);
        if(index < 0) continue;
        result[resultPos++] = i;
    }
    return result;
}

2.分类并打印实例数据,以查看统计模型是否正常工作:

    public static void main (String[] args) throws FileNotFoundException, IOException {

        // Process the command-line options
        CommandOption.setSummary (Csv2Classify.class,
                                  "A tool for classifying a stream of unlabeled instances");
        CommandOption.process (Csv2Classify.class, args);

        // Print some helpful messages for error cases
        if (args.length == 0) {
            CommandOption.getList(Csv2Classify.class).printUsage(false);
            System.exit (-1);
        }
        if (inputFile == null) {
            throw new IllegalArgumentException ("You must include `--input FILE ...' in order to specify a"+
                                "file containing the instances, one per line.");
        }

      // Read classifier from file
        Classifier classifier = null;
        try {
            ObjectInputStream ois =
                new ObjectInputStream (new BufferedInputStream(new FileInputStream (classifierFile.value)));

            classifier = (Classifier) ois.readObject();
            ois.close();
        } catch (Exception e) {
            throw new IllegalArgumentException("Problem loading classifier from file " + classifierFile.value +
                               ": " + e.getMessage());
        }

        // Read instances from the file
        Reader fileReader;
        if (inputFile.value.toString().equals ("-")) {
            fileReader = new InputStreamReader (System.in);
        }
        else {
            fileReader = new InputStreamReader(new FileInputStream(inputFile.value), encoding.value);
        }
        Iterator<Instance> csvIterator =
            new CsvIterator (fileReader, Pattern.compile(lineRegex.value),
            dataOption.value, 0, nameOption.value);
        Iterator<Instance> iterator =
            classifier.getInstancePipe().newIteratorFrom(csvIterator);

        // Write classifications to the output file
        PrintStream out = null;

        if (outputFile.value.toString().equals ("-")) {
            out = System.out;
        }
        else {
            out = new PrintStream(outputFile.value, encoding.value);
        }

        // gdruck@cs.umass.edu
        // Stop growth on the alphabets. If this is not done and new
        // features are added, the feature and classifier parameter
        // indices will not match.
        classifier.getInstancePipe().getDataAlphabet().stopGrowth();
        classifier.getInstancePipe().getTargetAlphabet().stopGrowth();

        while (iterator.hasNext()) {
            Instance instance = iterator.next();

            Labeling labeling =
                classifier.classify(instance).getLabeling();

            StringBuilder output = new StringBuilder();
            output.append(instance.getName() + "\n");                       
            output.append("\t" + instance.getData() + "\n");
            int[] topLocations = Csv2Classify.getTopLocations(labeling, 10);
            for (int index = 0; index < topLocations.length; index++) {
                int location = topLocations[index];
                System.out.print("location printed:" + location + "\n");
                output.append("\t" + labeling.labelAtLocation(location));
                output.append("\t" + labeling.valueAtLocation(location));
            }
            output.append("\n");
            out.println(output);
        }

        if (! outputFile.value.toString().equals ("-")) {
            out.close();
        }
    }
}   

我使用决策树进行了50次试验。然后,我用来分类的命令是:

bin/mallet classify-file --input data --output classification.output --classifier decision_tree.classifier

源文件:

10914642 sky room business office people young teamwork success professional monitor contemporary entrepreneur customer idea businesspeople window adult person pc indoor guy worker confident attractive operator male job career communication handsome book chair businessman manager work table meeting workplace cloud headset caucasian man lamp executive successful corporate occupation concept
13209539 performance industry panel results photovoltaic technology energy security_helmet nature running eco-friendly ecology sky android touchpad function environment businessman plant solar_panel architecture man analysis worker renewable_energy wireless business electronic_tablet solarium construction engineering cell operation electricity power light sensor electric collector durable setup senior alternative installation solarization checking touchscreen engineer
26375762 building hat occupation expert plan professional engineering confident architector business businessman designer helmet engineer man hardhat architect construction worker suit executive work builder
26780099 desk male sitting technology headphone arabian laptop flare office resting business startup morning job person work casual eastern men creative indoors businessman workplace relax play lifestyle worker phone beard sunset arab sunrise computer professional sun break effect handsome young leisure legs happy hipster people
26783548 lifestyle manager male elegance use one adult work executive cellphone intelligence notebook business laptop technology entrepreneur middle-aged modern smart busy communication job businessman contemporary occupation corporate urban smile man message senior mobile caucasian city computer hold expertise suit professional wireless restaurant internet look break worker phone cafe table smartphone sit
26783561 elegant intelligent caucasian read urban friendly coffee laptop table hold businessman folder sit adult cafeteria computer concentration senior informed lifestyle restaurant male smartphone confident drink look worker one entrepreneur break business middle-aged corporate job cup smart professional work cafe successful technology modern city executive document serious paper man contemporary beard
26958424 serious male formal connection vision city executive mature worker businessman professional phone glasses aspirations call confidence street confident communication corporate feminism successful outdoors solution device lifestyle guy calling standing smart adult caucasian business shirt technology success person suit outside entrepreneur tie urban building man gadget thoughtful smartphone
27207487 leisure window business resting mustache networking beard lonely rest hobby relaxation thinking break lifestyle pool_table mobile_phone sitting room style alone man businessman pool connection locker technology relax
27210236 information workplace development office plan organization research entrepreneur office_worker meeting discussion businessman business_people interaction corporate operations analysis busy strategic process strategy workspace cooperation communication white_collar_worker talking statistics enterpriser corporate_business motivation global_finance investment objective global_market working mission business collaboration thinking global_business planning solution vision tactics marketing
27344048 businessman brick alone management telecommunication planning plan connection working talking internet computer white_collar_worker workspace technology research startup mobile_phone office business window brick_wall laptop strategy place_of_work online on_the_phone man thinking wireless digital_device workplace business_person communication

结果文件:

10914642
    person(62)=1.0
job(121)=1.0
occupation(128)=1.0
work(159)=1.0
man(195)=1.0
male(203)=1.0
attractive(204)=1.0
handsome(206)=1.0
confident(209)=1.0
worker(210)=1.0
adult(220)=1.0
caucasian(222)=1.0
young(238)=1.0
people(239)=1.0
professional(327)=1.0
guy(354)=1.0
business(369)=1.0
successful(370)=1.0
executive(371)=1.0
success(376)=1.0
office(379)=1.0
entrepreneur(382)=1.0
corporate(390)=1.0
businesspeople(392)=1.0
businessman(395)=1.0
table(443)=1.0
manager(506)=1.0
meeting(520)=1.0
communication(560)=1.0
teamwork(579)=1.0
indoor(615)=1.0
room(622)=1.0
idea(729)=1.0
workplace(737)=1.0
contemporary(740)=1.0
career(798)=1.0
lamp(829)=1.0
concept(850)=1.0
window(924)=1.0
book(1216)=1.0
cloud(1318)=1.0
sky(1333)=1.0
headset(1595)=1.0
chair(1808)=1.0
customer(2171)=1.0
operator(2345)=1.0
monitor(2741)=1.0

    9466    0.20320855614973263 9467    0.10160427807486631 9505    0.0427807486631016  9514    0.016042780748663103    9468    0.053475935828877004    9462    0.13903743315508021 9463    0.13368983957219252 9460    0.22994652406417113 9486    0.0374331550802139  9506    0.016042780748663103

13209539
    nature(55)=1.0
man(195)=1.0
worker(210)=1.0
industry(218)=1.0
construction(232)=1.0
business(369)=1.0
businessman(395)=1.0
environment(550)=1.0
light(552)=1.0
power(567)=1.0
engineering(617)=1.0
engineer(630)=1.0
senior(653)=1.0
performance(669)=1.0
energy(790)=1.0
electric(1001)=1.0
electricity(1007)=1.0
checking(1295)=1.0
sky(1333)=1.0
installation(1350)=1.0
technology(1395)=1.0
wireless(1461)=1.0
analysis(1797)=1.0
plant(2419)=1.0
alternative(2693)=1.0
results(2748)=1.0
architecture(3206)=1.0
android(3253)=1.0
touchpad(3256)=1.0
running(3294)=1.0
panel(3598)=1.0
photovoltaic(3599)=1.0
security_helmet(3600)=1.0
eco-friendly(3601)=1.0
ecology(3602)=1.0
function(3603)=1.0
solar_panel(3604)=1.0
renewable_energy(3605)=1.0
electronic_tablet(3606)=1.0
solarium(3607)=1.0
cell(3608)=1.0
operation(3609)=1.0
sensor(3610)=1.0
collector(3611)=1.0
durable(3612)=1.0
setup(3613)=1.0
solarization(3614)=1.0
touchscreen(3615)=1.0

    9466    0.20320855614973263 9467    0.10160427807486631 9505    0.0427807486631016  9514    0.016042780748663103    9468    0.053475935828877004    9462    0.13903743315508021 9463    0.13368983957219252 9460    0.22994652406417113 9486    0.0374331550802139  9506    0.016042780748663103

26375762
    occupation(128)=1.0
work(159)=1.0
man(195)=1.0
hat(208)=1.0
confident(209)=1.0
worker(210)=1.0
hardhat(223)=1.0
construction(232)=1.0
professional(327)=1.0
plan(346)=1.0
business(369)=1.0
executive(371)=1.0
suit(377)=1.0
businessman(395)=1.0
building(542)=1.0
engineering(617)=1.0
engineer(630)=1.0
helmet(634)=1.0
builder(1017)=1.0
designer(2675)=1.0
expert(3458)=1.0
architect(4485)=1.0
architector(7523)=1.0

    9466    0.20320855614973263 9467    0.10160427807486631 9505    0.0427807486631016  9514    0.016042780748663103    9468    0.053475935828877004    9462    0.13903743315508021 9463    0.13368983957219252 9460    0.22994652406417113 9486    0.0374331550802139  9506    0.016042780748663103

26780099
    person(62)=1.0
job(121)=1.0
work(159)=1.0
male(203)=1.0
handsome(206)=1.0
worker(210)=1.0
lifestyle(230)=1.0
young(238)=1.0
people(239)=1.0
men(241)=1.0
eastern(320)=1.0
happy(323)=1.0
professional(327)=1.0
play(364)=1.0
business(369)=1.0
office(379)=1.0
indoors(389)=1.0
businessman(395)=1.0
laptop(512)=1.0
computer(517)=1.0
sitting(588)=1.0
casual(631)=1.0
workplace(737)=1.0
leisure(783)=1.0
phone(804)=1.0
beard(871)=1.0
desk(928)=1.0
sun(1329)=1.0
technology(1395)=1.0
headphone(1639)=1.0
creative(2040)=1.0
relax(2166)=1.0
break(2247)=1.0
legs(2256)=1.0
morning(2415)=1.0
sunset(2538)=1.0
hipster(3302)=1.0
sunrise(3325)=1.0
arabian(3671)=1.0
startup(3699)=1.0
effect(4314)=1.0
resting(5592)=1.0
flare(7660)=1.0
arab(7661)=1.0

    9466    0.20320855614973263 9467    0.10160427807486631 9505    0.0427807486631016  9514    0.016042780748663103    9468    0.053475935828877004    9462    0.13903743315508021 9463    0.13368983957219252 9460    0.22994652406417113 9486    0.0374331550802139  9506    0.016042780748663103

26783548
    job(121)=1.0
occupation(128)=1.0
smile(150)=1.0
work(159)=1.0
man(195)=1.0
one(199)=1.0
male(203)=1.0
worker(210)=1.0
adult(220)=1.0
caucasian(222)=1.0
lifestyle(230)=1.0
professional(327)=1.0
business(369)=1.0
executive(371)=1.0
suit(377)=1.0
entrepreneur(382)=1.0
corporate(390)=1.0
businessman(395)=1.0
table(443)=1.0
restaurant(448)=1.0
manager(506)=1.0
laptop(512)=1.0
computer(517)=1.0
modern(540)=1.0
look(557)=1.0
communication(560)=1.0
message(574)=1.0
elegance(644)=1.0
senior(653)=1.0
smart(732)=1.0
contemporary(740)=1.0
busy(746)=1.0
expertise(775)=1.0
phone(804)=1.0
mobile(881)=1.0
cellphone(1223)=1.0
hold(1237)=1.0
sit(1364)=1.0
technology(1395)=1.0
wireless(1461)=1.0
use(1645)=1.0
internet(1805)=1.0
notebook(1809)=1.0
city(1865)=1.0
smartphone(2216)=1.0
break(2247)=1.0
urban(3021)=1.0
middle-aged(3492)=1.0
cafe(4856)=1.0
intelligence(4943)=1.0

    9466    0.20320855614973263 9467    0.10160427807486631 9505    0.0427807486631016  9514    0.016042780748663103    9468    0.053475935828877004    9462    0.13903743315508021 9463    0.13368983957219252 9460    0.22994652406417113 9486    0.0374331550802139  9506    0.016042780748663103

26783561
    drink(109)=1.0
job(121)=1.0
work(159)=1.0
man(195)=1.0
one(199)=1.0
male(203)=1.0
confident(209)=1.0
worker(210)=1.0
adult(220)=1.0
caucasian(222)=1.0
serious(224)=1.0
lifestyle(230)=1.0
professional(327)=1.0
friendly(329)=1.0
business(369)=1.0
successful(370)=1.0
executive(371)=1.0
entrepreneur(382)=1.0
corporate(390)=1.0
businessman(395)=1.0
table(443)=1.0
restaurant(448)=1.0
elegant(455)=1.0
cup(485)=1.0
laptop(512)=1.0
computer(517)=1.0
folder(523)=1.0
modern(540)=1.0
look(557)=1.0
senior(653)=1.0
smart(732)=1.0
contemporary(740)=1.0
document(744)=1.0
paper(745)=1.0
beard(871)=1.0
coffee(947)=1.0
hold(1237)=1.0
sit(1364)=1.0
read(1372)=1.0
technology(1395)=1.0
concentration(1428)=1.0
city(1865)=1.0
smartphone(2216)=1.0
break(2247)=1.0
urban(3021)=1.0
middle-aged(3492)=1.0
cafeteria(4134)=1.0
cafe(4856)=1.0
intelligent(5655)=1.0
informed(7666)=1.0

    9466    0.20320855614973263 9467    0.10160427807486631 9505    0.0427807486631016  9514    0.016042780748663103    9468    0.053475935828877004    9462    0.13903743315508021 9463    0.13368983957219252 9460    0.22994652406417113 9486    0.0374331550802139  9506    0.016042780748663103

26958424
    person(62)=1.0
confidence(194)=1.0
man(195)=1.0
male(203)=1.0
confident(209)=1.0
worker(210)=1.0
adult(220)=1.0
caucasian(222)=1.0
serious(224)=1.0
outdoors(227)=1.0
lifestyle(230)=1.0
standing(236)=1.0
professional(327)=1.0
guy(354)=1.0
tie(367)=1.0
business(369)=1.0
successful(370)=1.0
executive(371)=1.0
success(376)=1.0
suit(377)=1.0
entrepreneur(382)=1.0
corporate(390)=1.0
businessman(395)=1.0
glasses(456)=1.0
building(542)=1.0
mature(553)=1.0
communication(560)=1.0
smart(732)=1.0
formal(733)=1.0
outside(784)=1.0
phone(804)=1.0
shirt(876)=1.0
solution(1021)=1.0
technology(1395)=1.0
connection(1466)=1.0
city(1865)=1.0
smartphone(2216)=1.0
vision(2313)=1.0
call(2347)=1.0
calling(2491)=1.0
device(2574)=1.0
urban(3021)=1.0
thoughtful(3358)=1.0
street(3428)=1.0
gadget(4981)=1.0
feminism(7584)=1.0
aspirations(7741)=1.0

    9466    0.20320855614973263 9467    0.10160427807486631 9505    0.0427807486631016  9514    0.016042780748663103    9468    0.053475935828877004    9462    0.13903743315508021 9463    0.13368983957219252 9460    0.22994652406417113 9486    0.0374331550802139  9506    0.016042780748663103

27207487
    man(195)=1.0
lifestyle(230)=1.0
style(306)=1.0
business(369)=1.0
businessman(395)=1.0
sitting(588)=1.0
room(622)=1.0
leisure(783)=1.0
beard(871)=1.0
window(924)=1.0
technology(1395)=1.0
connection(1466)=1.0
thinking(1487)=1.0
alone(1978)=1.0
relax(2166)=1.0
break(2247)=1.0
mobile_phone(2284)=1.0
mustache(2332)=1.0
hobby(2524)=1.0
relaxation(2702)=1.0
networking(3698)=1.0
pool(5219)=1.0
resting(5592)=1.0
rest(5768)=1.0
lonely(7803)=1.0
pool_table(7804)=1.0
locker(7805)=1.0

    9466    0.20320855614973263 9467    0.10160427807486631 9505    0.0427807486631016  9514    0.016042780748663103    9468    0.053475935828877004    9462    0.13903743315508021 9463    0.13368983957219252 9460    0.22994652406417113 9486    0.0374331550802139  9506    0.016042780748663103

27210236
    information(41)=1.0
working(205)=1.0
plan(346)=1.0
business(369)=1.0
office(379)=1.0
entrepreneur(382)=1.0
corporate(390)=1.0
businessman(395)=1.0
meeting(520)=1.0
cooperation(524)=1.0
communication(560)=1.0
discussion(600)=1.0
collaboration(718)=1.0
interaction(734)=1.0
workplace(737)=1.0
busy(746)=1.0
planning(759)=1.0
strategy(760)=1.0
talking(781)=1.0
solution(1021)=1.0
research(1081)=1.0
business_people(1222)=1.0
thinking(1487)=1.0
development(1596)=1.0
analysis(1797)=1.0
mission(1887)=1.0
global_business(2246)=1.0
office_worker(2274)=1.0
vision(2313)=1.0
tactics(2586)=1.0
investment(3004)=1.0
marketing(3197)=1.0
organization(3801)=1.0
motivation(3802)=1.0
operations(4660)=1.0
white_collar_worker(4840)=1.0
process(4983)=1.0
corporate_business(5542)=1.0
workspace(5706)=1.0
enterpriser(6814)=1.0
strategic(7806)=1.0
statistics(7807)=1.0
global_finance(7808)=1.0
objective(7809)=1.0
global_market(7810)=1.0

    9466    0.20320855614973263 9467    0.10160427807486631 9505    0.0427807486631016  9514    0.016042780748663103    9468    0.053475935828877004    9462    0.13903743315508021 9463    0.13368983957219252 9460    0.22994652406417113 9486    0.0374331550802139  9506    0.016042780748663103

27344048
    man(195)=1.0
working(205)=1.0
plan(346)=1.0
business(369)=1.0
office(379)=1.0
businessman(395)=1.0
business_person(507)=1.0
place_of_work(510)=1.0
laptop(512)=1.0
computer(517)=1.0
communication(560)=1.0
workplace(737)=1.0
planning(759)=1.0
strategy(760)=1.0
talking(781)=1.0
window(924)=1.0
research(1081)=1.0
technology(1395)=1.0
wireless(1461)=1.0
connection(1466)=1.0
thinking(1487)=1.0
online(1804)=1.0
internet(1805)=1.0
alone(1978)=1.0
mobile_phone(2284)=1.0
management(2985)=1.0
startup(3699)=1.0
digital_device(4572)=1.0
white_collar_worker(4840)=1.0
brick(5553)=1.0
workspace(5706)=1.0
telecommunication(7883)=1.0
brick_wall(7884)=1.0
on_the_phone(7885)=1.0

    9466    0.20320855614973263 9467    0.10160427807486631 9505    0.0427807486631016  9514    0.016042780748663103    9468    0.053475935828877004    9462    0.13903743315508021 9463    0.13368983957219252 9460    0.22994652406417113 9486    0.0374331550802139  9506    0.016042780748663103



    9480    0.03642671292281006 9496    0.033824804856895055    9481    0.03469210754553339 9499    0.03555941023417172 9491    0.03295750216825672 9517    0.03816131830008673 9501    0.03469210754553339 9492    0.03469210754553339 9478    0.03469210754553339 9506    0.03469210754553339

我还会包含指向我的训练数据和分类器文件的链接,以防我在解析训练数据或训练过程时出错:

非常感谢任何帮助。

0 个答案:

没有答案