预测哪些用户认为评论有用(或#34;空白"如果到目前为止没有人发现它有用)。要么:1)预测用户字符串(假设订单总是按字母顺序排列);或者2)对于每个用户,预测他们是否会发现评论有用。目前,用户数量有限(少于10个),并且可以接受此代码。但有趣的是考虑一个预测更多用户的未来应用程序(让我们说几千或几百万可能的用户)。
示例数据:train.csv
"id","title","review","user tags","user(s) who find review helpful"
"123","All movies!","I really love movies","love,all","Bill"
"456","No movies!","I really hate movies","hate,none","Jane"
"789","Great show!","That show was really great","great,really","Bill,Jane,Wanda"
"899","Interesting plot!","He makes the plot interesting","interesting,plot",""
"999","So tired!","The ending made me sleep","ending,tired,sleepy",""
测试:使用第1,2,3列中的文本预测文本列4.忽略id数字列0。
到目前为止,我正在使用此处的指南(http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)。
当前代码:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.neural_network import MLPClassifier
'''text_clf = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier(loss='hinge', penalty='l2',
alpha=1e-3, random_state=42,
max_iter=5, tol=None)),
])
text_clf = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB()),
])'''
text_clf = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf',
MLPClassifier(solver='lbfgs', alpha=1e-5,
hidden_layer_sizes=(50, 20),
random_state=1, shuffle=True, max_iter=200)
),
])
data = pd.read_csv('./train.csv',
error_bad_lines=False,header=None,sep=',',
dtype={
0: np.dtype('u8'), # id, 64-bit unsigned integer
2: np.dtype('U'), # title, unicode
3: np.dtype('U'), # review, unicode
4: np.dtype('U'), # tags, unicode
5: np.dtype('U'), # name(s), unicode
})
# TODO: Split user names column by comma.
xtr = data.iloc[0:100000,1:5].astype(str).values
ytr = data.iloc[0:100000,5].values
xtest = data.iloc[100001:101000,1:5].values
ytest = data.iloc[100001:101000,5]
text_clf.fit(xtr, ytr)
predicted = text_clf.predict(xtest)
print(np.mean(predicted == ytest))
产生以下输出:
---> 38 predicted = text_clf.predict(data.iloc[100001:101000,5].values)
AttributeError: 'numpy.int64' object has no attribute 'lower'