如何修复python中稀疏矩阵的“NaN或无穷大”问题?

时间:2013-09-22 19:49:54

标签: python scikit-learn nan

我对python完全不熟悉。我已经使用了一些在线发现的代码,我试图对它进行处理。所以我正在创建一个文本文档矩阵,我想在训练逻辑回归模型之前添加一些额外的功能。

虽然我用R检查了我的数据并且没有错误,但是当我运行逻辑回归时,我得到错误“ValueError:Array包含NaN或无穷大。”我没有得到当我不添加自己的功能时,同样的错误。我的功能在文件“toPython.txt”中。

注意两次调用返回“None”的 assert_all_finite 函数!

下面是我使用的代码和我得到的输出:

def _assert_all_finite(X):
if X.dtype.char in np.typecodes['AllFloat'] and not np.isfinite(X.sum()) and not np.isfinite(X).all():
    raise ValueError("Array contains NaN or infinity.")

def assert_all_finite(X):
_assert_all_finite(X.data if sparse.issparse(X) else X)

def main():

print "loading data.."
traindata = list(np.array(p.read_table('C:/Users/Stergios/Documents/Python/data/train.tsv'))[:,2])
testdata = list(np.array(p.read_table('C:/Users/Stergios/Documents/Python/data/test.tsv'))[:,2])
y = np.array(p.read_table('C:/Users/Stergios/Documents/Python/data/train.tsv'))[:,-1]

tfv = TfidfVectorizer(min_df=12,  max_features=None, strip_accents='unicode',  
    analyzer='word',stop_words='english', lowercase=True,
    token_pattern=r'\w{1,}',ngram_range=(1, 1), use_idf=1,smooth_idf=1,sublinear_tf=1)

rd = lm.LogisticRegression(penalty='l2', dual=True, tol=0.0001, 
                         C=1, fit_intercept=True, intercept_scaling=1.0, 
                         class_weight=None, random_state=None)

X_all = traindata + testdata
lentrain = len(traindata)

f = np.array(p.read_table('C:/Users/Stergios/Documents/Python/data/toPython.txt'))
indices = np.nonzero(~np.isnan(f))
b = csr_matrix((f[indices], indices), shape=f.shape, dtype='float')

print b.get_shape
**print assert_all_finite(b)**
print "fitting pipeline"
tfv.fit(X_all)
print "transforming data"
X_all = tfv.transform(X_all)
print X_all.get_shape

X_all=hstack( [X_all,b], format='csr' )
print X_all.get_shape

**print assert_all_finite(X_all)**

X = X_all[:lentrain]
print "3 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rd, X, y, cv=3, scoring='roc_auc'))

输出是:

loading data..
<bound method csr_matrix.get_shape of <10566x40 sparse matrix of type '<type 'numpy.float64'>'
with 422640 stored elements in Compressed Sparse Row format>>
**None**
fitting pipeline
transforming data
<bound method csr_matrix.get_shape of <10566x13913 sparse matrix of type '<type 'numpy.float64'>'
with 1450834 stored elements in Compressed Sparse Row format>>
<bound method csr_matrix.get_shape of <10566x13953 sparse matrix of type '<type 'numpy.float64'>'
with 1873474 stored elements in Compressed Sparse Row format>>
**None**
3 Fold CV Score: 
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 523, in runfile
execfile(filename, namespace)
File "C:\Users\Stergios\Documents\Python\beat_bench.py", line 100, in <module>
main()
File "C:\Users\Stergios\Documents\Python\beat_bench.py", line 97, in main
print "3 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rd, X, y, cv=3, scoring='roc_auc'))
File "C:\Python27\lib\site-packages\sklearn\cross_validation.py", line 1152, in cross_val_score
for train, test in cv)
File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 517, in __call__
self.dispatch(function, args, kwargs)
File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 312, in dispatch
job = ImmediateApply(func, args, kwargs)
File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 136, in __init__
self.results = func(*args, **kwargs)
File "C:\Python27\lib\site-packages\sklearn\cross_validation.py", line 1064, in _cross_val_score
score = scorer(estimator, X_test, y_test)
File "C:\Python27\lib\site-packages\sklearn\metrics\scorer.py", line 141, in __call__
return self._sign * self._score_func(y, y_pred, **self._kwargs)
File "C:\Python27\lib\site-packages\sklearn\metrics\metrics.py", line 403, in roc_auc_score
fpr, tpr, tresholds = roc_curve(y_true, y_score)
File "C:\Python27\lib\site-packages\sklearn\metrics\metrics.py", line 672, in roc_curve
fps, tps, thresholds = _binary_clf_curve(y_true, y_score, pos_label)
File "C:\Python27\lib\site-packages\sklearn\metrics\metrics.py", line 504, in _binary_clf_curve
y_true, y_score = check_arrays(y_true, y_score)
File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 233, in check_arrays
_assert_all_finite(array)
File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 27, in _assert_all_finite
raise ValueError("Array contains NaN or infinity.")
ValueError: Array contains NaN or infinity.

有什么想法吗?谢谢!!

3 个答案:

答案 0 :(得分:3)

我发现执行以下操作,假设sm是一个稀疏矩阵(我的是CSR矩阵,如果你知道,请说一下其他类型!)工作得非常好:

在数据向量中用适当的数字手动替换nan

In [4]: np.isnan(matrix.data).any()
Out[4]: True

In [5]: sm.data.shape
Out[5]: (553555,)

In [6]: sm.data = np.nan_to_num(sm.data)

In [7]: np.isnan(matrix.data).any()
Out[7]: False

In [8]: sm.data.shape
Out[8]: (553555,)

所以我们不再有nan个值,但矩阵显式地将这些零值编码为有价值的索引。

从稀疏矩阵中删除显式编码的零值:

In [9]: sm.eliminate_zeros()

In [10]: sm.data.shape
Out[10]: (551391,)

我们的矩阵现在实际上变小了,耶!

答案 1 :(得分:1)

当您的数据中缺少值或处理结果时,通常会发生这种情况。

首先,使用XNan值找到稀疏矩阵Inf中的单元格:

def find_nan_in_csr(self, X):

    X = coo_matrix(X)
    for i, j, v in zip(X.row, X.col, X.data):
        if (np.isnan(v) or np.isinf(v)):
            print(i, j, v)
    return None

此函数将为稀疏矩阵中的行和列索引提供值,这些索引存在问题 然后,到#34;修复&#34;值 - 它取决于导致这些值的原因(缺失值等)。

修改 请注意,sklearn通常使用dtype=np.float32来获得最高效率, 所以它可以将稀疏矩阵转换为np.float32(由X = X.astype(dtype = np.float32))。 在从float64到np.float32的转换中,一个非常高的数字(例如,2.9e+200)被转换为inf

答案 2 :(得分:1)

我通常使用这个功能:

x = np.nan_to_num(x)

用零和inf替换有限数字的nan。