我正在对Amazon Reviews进行练习,下面是代码。 基本上,我不能在应用BoW之后将列(熊猫数组)添加到CSR矩阵中。 即使两个矩阵中的行数匹配,我也无法通过。
import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
import scipy
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer
from sklearn.manifold import TSNE
#Create Connection to sqlite3
con = sqlite3.connect('C:/Users/609316120/Desktop/Python/Amazon_Review_Exercise/database/database.sqlite')
filtered_data = pd.read_sql_query("""select * from Reviews where Score != 3""", con)
def partition(x):
if x < 3:
return 'negative'
return 'positive'
actualScore = filtered_data['Score']
actualScore.head()
positiveNegative = actualScore.map(partition)
positiveNegative.head(10)
filtered_data['Score'] = positiveNegative
filtered_data.head(1)
filtered_data.shape
display = pd.read_sql_query("""select * from Reviews where Score !=3 and Userid="AR5J8UI46CURR" ORDER BY PRODUCTID""", con)
sorted_data = filtered_data.sort_values('ProductId', axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')
final=sorted_data.drop_duplicates(subset={"UserId","ProfileName","Time","Text"}, keep='first', inplace=False)
final.shape
display = pd.read_sql_query(""" select * from reviews where score != 3 and id=44737 or id = 64422 order by productid""", con)
final=final[final.HelpfulnessNumerator<=final.HelpfulnessDenominator]
final['Score'].value_counts()
count_vect = CountVectorizer()
final_counts = count_vect.fit_transform(final['Text'].values)
final_counts.shape
type(final_counts)
positive_negative = final['Score']
#Below is giving error
final_counts = hstack((final_counts,positive_negative))
答案 0 :(得分:2)
sparse.hstack
将输入的coo
格式矩阵组合成新的coo
格式矩阵。
final_counts
是一个csr
矩阵,因此sparse.coo_matrix(final_counts)
转换是微不足道的。
positive_negative
是DataFrame的一列。看
sparse.coo_matrix(positive_negative)
它可能是(1,n)稀疏矩阵。但是要将其与final_counts
结合使用,它必须是(1,n)形状。
尝试创建稀疏矩阵并将其转置:
sparse.hstack((final_counts, sparse.coo_matrix(positive_negative).T))
答案 1 :(得分:0)
Used Below but still getting error
merged_data = scipy.sparse.hstack((final_counts, scipy.sparse.coo_matrix(positive_negative).T))
Below is the error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'sparse' is not defined
>>> merged_data = scipy.sparse.hstack((final_counts, sparse.coo_matrix(positive_
negative).T))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'sparse' is not defined
>>> merged_data = scipy.sparse.hstack((final_counts, scipy.sparse.coo_matrix(pos
itive_negative).T))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python34\lib\site-packages\scipy\sparse\construct.py", line 464, in h
stack
return bmat([blocks], format=format, dtype=dtype)
File "C:\Python34\lib\site-packages\scipy\sparse\construct.py", line 600, in b
mat
dtype = upcast(*all_dtypes) if all_dtypes else None
File "C:\Python34\lib\site-packages\scipy\sparse\sputils.py", line 52, in upca
st
raise TypeError('no supported conversion for types: %r' % (args,))
TypeError: no supported conversion for types: (dtype('int64'), dtype('O'))
答案 2 :(得分:0)
即使我也遇到稀疏矩阵的相同问题。您可以使用todense()
将CSR矩阵转换为稠密的,然后可以使用np.hstack((dataframe.values,converted_dense_matrix))。它将正常工作。您无法使用numpy.hstack处理稀疏矩阵
但是,对于非常大的数据集,转换为密集矩阵并不是一个好主意。在您的情况下,scipy hstack无法工作,因为hstack(int,object)中的数据类型不同。
尝试positive_negative = final ['Score']。values并scipy.sparse.h将其堆叠。如果它不起作用,可以给我您的positive_negative.dtype输出