首先训练模型并多次测试

时间:2019-12-16 16:23:30

标签: python command-line nlp

我一直试图在基于QT GUI的C ++应用程序中使用python的NLP脚本。 基本上在应用程序中,我试图通过命令行访问NLP脚本:

__init__.py

上面的工作正常。但是问题是,执行该过程大约需要40-50秒,因为它首先要训练模型,然后进行测试。 但是我想先训练模型并像在Jupyter Notebook中一样对它进行多次测试。 为此,我做了一个单独的功能来测试并尝试通过命令行访问它:

  

PS D:\ DS Project \ Treegramming> py nlp.py“ test('它真了不起')”

但是,这又是先执行整个脚本,然后再执行功能。有什么我可以解决的吗?

python脚本:

QString path = "D:/DS Project/Treegramming";
QString  command("py");
QStringList params = QStringList() << "nlp.py";
params << text;
QProcess *process = new QProcess();
process->setWorkingDirectory(path);
process->start(command, params);
process->waitForFinished();
QString result = process->readAll();

1 个答案:

答案 0 :(得分:1)

您需要创建两个python脚本:

  • 首先训练保存 NaiveBayesClassifier
  • 第二次加载测试该模型。

为防止重复代码,我将创建一个用于实用功能的脚本,并将其命名为utils.py,该脚本应如下所示:

import re
import string
from nltk.tag import pos_tag
from nltk.stem.wordnet import WordNetLemmatizer

def lemmatize_sentence(tokens):
    sentence = []
    lematizer = WordNetLemmatizer()
    for word, tag in pos_tag(tokens):
        if tag.startswith('NN'):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'
        sentence.append( lematizer.lemmatize( word , pos ) )
    return sentence

def remove_noise(tokens , stop_words = ()):
    sentence = []
    for token, tag in pos_tag( tokens ):
        token = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+' , '',token)
        token = re.sub("(@[A-Za-z0-9_]+)","",token)

        if tag.startswith("NN"):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'

        lemmatizer = WordNetLemmatizer()
        token = lemmatizer.lemmatize(token, pos)

        if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_words:
            sentence.append( token.lower() )
    return sentence

def get_all_words(tokens_list):
    for tokens in tokens_list:
        for token in tokens:
            yield token

def get_tweets_for_model(tokens_list):
    for tweets in tokens_list:
        yield dict([token,True] for token in tweets)


然后创建培训脚本,我将其命名为train.py,它应如下所示:

import random
import pickle
from utils import *
from nltk import FreqDist
from nltk.corpus import stopwords
from nltk import NaiveBayesClassifier
from nltk.corpus import twitter_samples


positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')
text = twitter_samples.strings('tweets.20150430-223406.json')

tweet_tokens = twitter_samples.tokenized('positive_tweets.json')

stop_words = stopwords.words('english')

positive_tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
negative_tweet_tokens = twitter_samples.tokenized('negative_tweets.json')

positive_cleaned_tokens_list = []
negative_cleaned_tokens_list = []

for tokens in positive_tweet_tokens:
    positive_cleaned_tokens_list.append(remove_noise(tokens, stop_words))

for tokens in negative_tweet_tokens:
    negative_cleaned_tokens_list.append(remove_noise(tokens, stop_words))

all_pos_words = get_all_words( positive_cleaned_tokens_list )
all_neg_words = get_all_words( negative_cleaned_tokens_list )

freq_dis_pos = FreqDist( all_pos_words )
freq_dis_neg = FreqDist( all_neg_words )

positive_tokens_for_model = get_tweets_for_model(positive_cleaned_tokens_list)
negative_tokens_for_model = get_tweets_for_model(negative_cleaned_tokens_list)

pos_dataset = [(tweets,"Positive") for tweets in positive_tokens_for_model]
neg_dataset = [(tweets,"Negative") for tweets in negative_tokens_for_model]

dataset = pos_dataset + neg_dataset
random.shuffle(dataset)

train_data = dataset[:7000]
test_data = dataset[7000:]

classifier = NaiveBayesClassifier.train(train_data)

#### ADD THESE TO SAVE THE CLASSIFIER ####
with open("model.pickle", "wb") as fout:
    pickle.dump(classifier, fout)

最后,测试脚本test.py应该如下所示:

import sys
import pickle
from nltk import classify
from nltk.tokenize import word_tokenize

from utils import remove_noise

#### ADD THESE TO LOAD THE CLASSIFIER ####
with open('model.pickle', 'rb') as fin:
    classifier = pickle.load(fin)


def test( custom_tweet ):
    custom_tokens = remove_noise(word_tokenize(custom_tweet))
    res = classifier.classify(dict([token, True] for token in custom_tokens))
    print(res)
    f = open( "result.txt" , "w" )
    f.write(res)    
    f.close() 

eval( sys.argv[1] );

现在,运行一次train.py来训练朴素贝叶斯分类器,该分类器将创建一个名为model.pickle的新文件,其中包含经过训练的分类器。然后在自定义推文上从C ++应用程序运行test.pytest.py应该加载经过训练的模型model.pickle,并在给定的自定义推文上使用它。