我有一个包含酒店评论的数据集。我想预测评论是正面还是负面。但是我的数据集中没有因变量y。 我正准备使用NLTK和朴素的贝叶斯算法。请帮我解决这个问题。 这是我到目前为止的代码。
Reviews = dataset.iloc[:,18]
#print(Reviews)
#Cleaning the texts
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for num in range(0,10000):
#nltk.download('stopwords')
review = re.sub('[^a-zA-Z]' , ' ' , str(Reviews[num]))
review = review.lower()
review = review.split()
ps = PorterStemmer()
review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
review = ' '.join(review)
corpus.append(review)
print(corpus)
#Creating the Bag of Words Model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray()
print(X)
答案 0 :(得分:1)
考虑到您没有目标类(因变量y),我认为您应该考虑一种无监督的学习方法,例如聚类。
答案 1 :(得分:0)
如果您没有目标变量,则可以尝试Textblob
from textblob import Textblob
testimonial = TextBlob("today is a bad day for me!")
print(testimonial.sentiment)
# o/p (polarity close to 1 means positive, close to -1 means negative)
Sentiment(polarity=-0.8749999999999998, subjectivity=0.6666666666666666)