带有熊猫数组的hstack csr矩阵

时间:2018-08-06 05:21:08

标签: pandas numpy scipy sparse-matrix

我正在对Amazon Reviews进行练习,下面是代码。 基本上,我不能在应用BoW之后将列(熊猫数组)添加到CSR矩阵中。 即使两个矩阵中的行数匹配,我也无法通过。

import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
import scipy
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer
from sklearn.manifold import TSNE

#Create Connection to sqlite3
con = sqlite3.connect('C:/Users/609316120/Desktop/Python/Amazon_Review_Exercise/database/database.sqlite')

filtered_data = pd.read_sql_query("""select * from Reviews where Score != 3""", con)
def partition(x):
    if x < 3:
       return 'negative'
    return 'positive'

actualScore = filtered_data['Score']
actualScore.head()
positiveNegative = actualScore.map(partition)
positiveNegative.head(10)
filtered_data['Score'] = positiveNegative
filtered_data.head(1)
filtered_data.shape

display = pd.read_sql_query("""select * from Reviews where Score !=3 and Userid="AR5J8UI46CURR" ORDER BY PRODUCTID""", con)

sorted_data = filtered_data.sort_values('ProductId', axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')

final=sorted_data.drop_duplicates(subset={"UserId","ProfileName","Time","Text"}, keep='first', inplace=False)

final.shape

display = pd.read_sql_query(""" select * from reviews where score != 3 and id=44737 or id = 64422 order by productid""", con)

final=final[final.HelpfulnessNumerator<=final.HelpfulnessDenominator]

final['Score'].value_counts()

count_vect = CountVectorizer()

final_counts = count_vect.fit_transform(final['Text'].values)

final_counts.shape

type(final_counts)

positive_negative = final['Score']

#Below is giving error
final_counts = hstack((final_counts,positive_negative))

3 个答案:

答案 0 :(得分:2)

sparse.hstack将输入的coo格式矩阵组合成新的coo格式矩阵。

final_counts是一个csr矩阵,因此sparse.coo_matrix(final_counts)转换是微不足道的。

positive_negative是DataFrame的一列。看

 sparse.coo_matrix(positive_negative)

它可能是(1,n)稀疏矩阵。但是要将其与final_counts结合使用,它必须是(1,n)形状。

尝试创建稀疏矩阵并将其转置:

sparse.hstack((final_counts, sparse.coo_matrix(positive_negative).T))

答案 1 :(得分:0)

Used Below but still getting error

merged_data = scipy.sparse.hstack((final_counts, scipy.sparse.coo_matrix(positive_negative).T))

Below is the error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'sparse' is not defined
>>> merged_data = scipy.sparse.hstack((final_counts, sparse.coo_matrix(positive_
negative).T))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'sparse' is not defined
>>> merged_data = scipy.sparse.hstack((final_counts, scipy.sparse.coo_matrix(pos
itive_negative).T))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python34\lib\site-packages\scipy\sparse\construct.py", line 464, in h
stack
    return bmat([blocks], format=format, dtype=dtype)
  File "C:\Python34\lib\site-packages\scipy\sparse\construct.py", line 600, in b
mat
    dtype = upcast(*all_dtypes) if all_dtypes else None
  File "C:\Python34\lib\site-packages\scipy\sparse\sputils.py", line 52, in upca
st
    raise TypeError('no supported conversion for types: %r' % (args,))
TypeError: no supported conversion for types: (dtype('int64'), dtype('O'))

答案 2 :(得分:0)

即使我也遇到稀疏矩阵的相同问题。您可以使用todense()将CSR矩阵转换为稠密的,然后可以使用np.hstack((dataframe.values,converted_dense_matrix))。它将正常工作。您无法使用numpy.hstack处理稀疏矩阵
但是,对于非常大的数据集,转换为密集矩阵并不是一个好主意。在您的情况下,scipy hstack无法工作,因为hstack(int,object)中的数据类型不同。 尝试positive_negative = final ['Score']。values并scipy.sparse.h将其堆叠。如果它不起作用,可以给我您的positive_negative.dtype输出