我想使用NLTK创建一个python脚本,或者最好能正确识别给定句子的任何库是疑问句(问题)还是不是。我尝试使用正则表达式,但有更深层次的情况下正则表达式失败。所以想使用自然语言处理可以帮助任何人!
答案 0 :(得分:2)
This可能会解决您的问题。
以下是代码:
import nltk
nltk.download('nps_chat')
posts = nltk.corpus.nps_chat.xml_posts()[:10000]
def dialogue_act_features(post):
features = {}
for word in nltk.word_tokenize(post):
features['contains({})'.format(word.lower())] = True
return features
featuresets = [(dialogue_act_features(post.text), post.get('class')) for post in posts]
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))
这应该打印像0.67这样的准确度。 如果要通过此分类器处理一串文本,请尝试:
print(classifier.classify(dialogue_act_features(line)))
您可以将字符串分类为ynQuestion,Statement等,并提取您想要的内容。
这种方法使用的是NaiveBayes,在我看来这是最简单的,但是肯定有很多方法可以解决这个问题。希望这有帮助!
答案 1 :(得分:1)
根据@PolkaDot 的回答,我创建了使用 NLTK 的函数,然后创建了一些自定义代码以获得更高的准确性。
posts = nltk.corpus.nps_chat.xml_posts()[:10000]
def dialogue_act_features(post):
features = {}
for word in nltk.word_tokenize(post):
features['contains({})'.format(word.lower())] = True
return features
featuresets = [(dialogue_act_features(post.text), post.get('class')) for post in posts]
# 10% of the total data
size = int(len(featuresets) * 0.1)
# first 10% for test_set to check the accuracy, and rest 90% after the first 10% for training
train_set, test_set = featuresets[size:], featuresets[:size]
# get the classifer from the training set
classifier = nltk.NaiveBayesClassifier.train(train_set)
# to check the accuracy - 0.67
# print(nltk.classify.accuracy(classifier, test_set))
question_types = ["whQuestion","ynQuestion"]
def is_ques_using_nltk(ques):
question_type = classifier.classify(dialogue_act_features(ques))
return question_type in question_types
然后
question_pattern = ["do i", "do you", "what", "who", "is it", "why","would you", "how","is there",
"are there", "is it so", "is this true" ,"to know", "is that true", "are we", "am i",
"question is", "tell me more", "can i", "can we", "tell me", "can you explain",
"question","answer", "questions", "answers", "ask"]
helping_verbs = ["is","am","can", "are", "do", "does"]
# check with custom pipeline if still this is a question mark it as a question
def is_question(question):
question = question.lower().strip()
if not is_ques_using_nltk(question):
is_ques = False
# check if any of pattern exist in sentence
for pattern in question_pattern:
is_ques = pattern in question
if is_ques:
break
# there could be multiple sentences so divide the sentence
sentence_arr = question.split(".")
for sentence in sentence_arr:
if len(sentence.strip()):
# if question ends with ? or start with any helping verb
# word_tokenize will strip by default
first_word = nltk.word_tokenize(sentence)[0]
if sentence.endswith("?") or first_word in helping_verbs:
is_ques = True
break
return is_ques
else:
return True
你只需要使用 is_question
方法来检查传递的句子是否是问题。
答案 2 :(得分:0)
您可以使用sklearn库通过简单的Gradient Boosting改进PolkaDot解决方案并达到约86%的精度。那将是这样的:
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report
nltk.download('nps_chat')
posts = nltk.corpus.nps_chat.xml_posts()
posts_text = [post.text for post in posts]
#divide train and test in 80 20
train_text = posts_text[:int(len(posts_text)*0.8)]
test_text = posts_text[int(len(posts_text)*0.2):]
#Get TFIDF features
vectorizer = TfidfVectorizer(ngram_range=(1,3),
min_df=0.001,
max_df=0.7,
analyzer='word')
X_train = vectorizer.fit_transform(train_text)
X_test = vectorizer.transform(test_text)
y = [post.get('class') for post in posts]
y_train = y[:int(len(posts_text)*0.8)]
y_test = y[int(len(posts_text)*0.2):]
# Fitting Gradient Boosting classifier to the Training set
gb = GradientBoostingClassifier(n_estimators = 400, random_state=0)
#Can be improved with Cross Validation
gb.fit(X_train, y_train)
predictions_rf = gb.predict(X_test)
#Accuracy of 86% not bad
print(classification_report(y_test, predictions_rf))
然后,您可以使用模型通过使用gb.predict(vectorizer.transform(['new sentence here'])
对新数据进行预测。