我正在编写一个python代码,该代码涉及使用自然语言处理分析数据集并验证Twitter更新。我的随机森林模型运作完美。
dataset = pd.read_csv('bully.txt', delimiter ='\t', quoting = 3)
corpus = []
for i in range(0,8576):
tweet = re.sub('[^a-zA-Z]', ' ', dataset['tweet'][i])
tweet = tweet.lower()
tweet = tweet.split()
ps = PorterStemmer()
tweet = [ps.stem(word) for word in tweet if not word in
set(stopwords.words('english'))]
tweet = ' '.join(tweet)
corpus.append(tweet)
将数据集转换为向量
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 10000)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values
拆分为火车和测试数据
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)
分类器模型
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
这是我访问推文的代码:
for status in tweepy.Cursor(api.home_timeline).items(1):
print "tweet: "+ status.text.encode('utf-8')
corpus1 = []
update = status.text
update = re.sub('[^a-zA-Z]', ' ', update)
update = update.lower()
update = update.split()
ps = PorterStemmer()
update = [ps.stem(word) for word in update if not word in set(stopwords.words('english'))]
update = ' '.join(update)
corpus1.append(update)
当我尝试使用模型对提取的Twitter更新进行分类时:
if classifier.predict(update):
print "bullying"
else:
print "not bullying"
我收到此错误:
ValueError: could not convert string to float: dude
如何将单个推文提供给模型?
我的数据集是:https://drive.google.com/open?id=1BG3cFszsZjAJ_pcST2jRxDH0ukf411M-