Question

我正在对Amazon Reviews进行练习，下面是代码。基本上，我不能在应用BoW之后将列（熊猫数组）添加到CSR矩阵中。即使两个矩阵中的行数匹配，我也无法通过。

import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
import scipy
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer
from sklearn.manifold import TSNE

#Create Connection to sqlite3
con = sqlite3.connect('C:/Users/609316120/Desktop/Python/Amazon_Review_Exercise/database/database.sqlite')

filtered_data = pd.read_sql_query("""select * from Reviews where Score != 3""", con)
def partition(x):
    if x < 3:
       return 'negative'
    return 'positive'

actualScore = filtered_data['Score']
actualScore.head()
positiveNegative = actualScore.map(partition)
positiveNegative.head(10)
filtered_data['Score'] = positiveNegative
filtered_data.head(1)
filtered_data.shape

display = pd.read_sql_query("""select * from Reviews where Score !=3 and Userid="AR5J8UI46CURR" ORDER BY PRODUCTID""", con)

sorted_data = filtered_data.sort_values('ProductId', axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')

final=sorted_data.drop_duplicates(subset={"UserId","ProfileName","Time","Text"}, keep='first', inplace=False)

final.shape

display = pd.read_sql_query(""" select * from reviews where score != 3 and id=44737 or id = 64422 order by productid""", con)

final=final[final.HelpfulnessNumerator<=final.HelpfulnessDenominator]

final['Score'].value_counts()

count_vect = CountVectorizer()

final_counts = count_vect.fit_transform(final['Text'].values)

final_counts.shape

type(final_counts)

positive_negative = final['Score']

#Below is giving error
final_counts = hstack((final_counts,positive_negative))

Answer 1

sparse.hstack将输入的coo格式矩阵组合成新的coo格式矩阵。

final_counts是一个csr矩阵，因此sparse.coo_matrix(final_counts)转换是微不足道的。

positive_negative是DataFrame的一列。看

 sparse.coo_matrix(positive_negative)

它可能是（1，n）稀疏矩阵。但是要将其与final_counts结合使用，它必须是（1，n）形状。

尝试创建稀疏矩阵并将其转置：

sparse.hstack((final_counts, sparse.coo_matrix(positive_negative).T))

Answer 2

Used Below but still getting error

merged_data = scipy.sparse.hstack((final_counts, scipy.sparse.coo_matrix(positive_negative).T))

Below is the error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'sparse' is not defined
>>> merged_data = scipy.sparse.hstack((final_counts, sparse.coo_matrix(positive_
negative).T))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'sparse' is not defined
>>> merged_data = scipy.sparse.hstack((final_counts, scipy.sparse.coo_matrix(pos
itive_negative).T))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python34\lib\site-packages\scipy\sparse\construct.py", line 464, in h
stack
    return bmat([blocks], format=format, dtype=dtype)
  File "C:\Python34\lib\site-packages\scipy\sparse\construct.py", line 600, in b
mat
    dtype = upcast(*all_dtypes) if all_dtypes else None
  File "C:\Python34\lib\site-packages\scipy\sparse\sputils.py", line 52, in upca
st
    raise TypeError('no supported conversion for types: %r' % (args,))
TypeError: no supported conversion for types: (dtype('int64'), dtype('O'))

Answer 3

即使我也遇到稀疏矩阵的相同问题。您可以使用todense()将CSR矩阵转换为稠密的，然后可以使用np.hstack（（dataframe.values，converted_dense_matrix））。它将正常工作。您无法使用numpy.hstack处理稀疏矩阵
但是，对于非常大的数据集，转换为密集矩阵并不是一个好主意。在您的情况下，scipy hstack无法工作，因为hstack（int，object）中的数据类型不同。尝试positive_negative = final ['Score']。values并scipy.sparse.h将其堆叠。如果它不起作用，可以给我您的positive_negative.dtype输出

带有熊猫数组的hstack csr矩阵

3 个答案: