我有一个看起来像这样的sparse.txt:
# first column is label 0 or 1
# rest of the data is sparse data
# maximum value in the data is 4, so the future dense matrix will
# have 1+4 = 5 elements in a row
# file: sparse.txt
1 1:1 2:1 3:1
0 1:1 4:1
1 2:1 3:1 4:1
所需的dense.txt是这样的:
# required file: dense.txt
1 1 1 1 0
0 1 0 0 1
1 0 1 1 1
不使用scipy coo_matrix,它就是这样做的简单方法:
def create_dense(fsparse, fdense,fvocab):
# number of lines in vocab
lvocab = sum(1 for line in open(fvocab))
# create dense file
with open(fsparse) as fi, open(fdense,'w') as fo:
for i, line in enumerate(fi):
words = line.strip('\n').split(':')
words = " ".join(words).split()
label = int(words[0])
indices = [int(w) for (i,w) in enumerate(words) if int(i)%2]
row = [0]* (lvocab+1)
row[0] = label
# use listcomps
row = [ 1 if i in indices else row[i] for i in range(len(row))]
l = " ".join(map(str,row)) + "\n"
fo.write(l)
print('Writing dense matrix line: ', i+1)
问题 如何在不先创建密集矩阵的情况下直接从稀疏数据中获取标签和数据,最好使用NUMPY / Scipy?
问题: 我们如何使用numpy.fromregex?
读取稀疏数据我的尝试是:
def read_file(fsparse):
regex = r'([0-1]\s)([0-9]):(1\s)*([0-9]:1)' + r'\s*\n'
data = np.fromregex(fsparse,regex,dtype=str)
print(data,file=open('dense.txt','w'))
它不起作用!
相关链接:
答案 0 :(得分:2)
(在明确禁止sklearn之前回答)
这基本上是svmlight / libsvm format。
只需使用scikit-learn's load_svmlight_file或效率更高的svmlight-loader。不需要在这里重新发明轮子!
from sklearn.datasets import load_svmlight_file
X, y = load_svmlight_file('C:/TEMP/sparse.txt')
print(X)
print(y)
print(X.todense())
输出:
(0, 0) 1.0
(0, 1) 1.0
(0, 2) 1.0
(1, 0) 1.0
(1, 3) 1.0
(2, 1) 1.0
(2, 2) 1.0
(2, 3) 1.0
[ 1. 0. 1.]
[[ 1. 1. 1. 0.]
[ 1. 0. 0. 1.]
[ 0. 1. 1. 1.]]
答案 1 :(得分:2)
调整代码以直接创建密集数组,而不是通过文件:
fsparse = 'stack47266965.txt'
def create_dense(fsparse, fdense, lvocab):
alist = []
with open(fsparse) as fi:
for i, line in enumerate(fi):
words = line.strip('\n').split(':')
words = " ".join(words).split()
label = int(words[0])
indices = [int(w) for (i,w) in enumerate(words) if int(i)%2]
row = [0]* (lvocab+1)
row[0] = label
# use listcomps
row = [ 1 if i in indices else row[i] for i in range(len(row))]
alist.append(row)
return alist
alist = create_dense(fsparse, fdense, 4)
print(alist)
import numpy as np
arr = np.array(alist)
from scipy import sparse
M = sparse.coo_matrix(arr)
print(M)
print(M.A)
产生
0926:~/mypy$ python3 stack47266965.py
[[1, 1, 1, 1, 0], [0, 1, 0, 0, 1], [1, 0, 1, 1, 1]]
(0, 0) 1
(0, 1) 1
(0, 2) 1
(0, 3) 1
(1, 1) 1
(1, 4) 1
(2, 0) 1
(2, 2) 1
(2, 3) 1
(2, 4) 1
[[1 1 1 1 0]
[0 1 0 0 1]
[1 0 1 1 1]]
如果您想跳过密集的arr
,则需要生成等效的M.row
,M.col
和M.data
属性(顺序无关紧要)
[0 0 0 0 1 1 2 2 2 2]
[0 1 2 3 1 4 0 2 3 4]
[1 1 1 1 1 1 1 1 1 1]
我没有使用regex
,所以我不会尝试解决这个问题。我假设你想转换
'1 1:1 2:1 3:1'
进入
['1' '1' '2' '2' '1' '3' '1']
但这只会让你进入words/label
阶段。
直接稀疏:
def create_sparse(fsparse, lvocab):
row, col, data = [],[],[]
with open(fsparse) as fi:
for i, line in enumerate(fi):
words = line.strip('\n').split(':')
words = " ".join(words).split()
label = int(words[0])
row.append(i); col.append(0); data.append(label)
indices = [int(w) for (i,w) in enumerate(words) if int(i)%2]
for j in indices: # quick-n-dirty version
row.append(i); col.append(j); data.append(1)
return row, col, data
r,c,d = create_sparse(fsparse, 4)
print(r,c,d)
M = sparse.coo_matrix((d,(r,c)))
print(M)
print(M.A)
制造
[0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2] [0, 1, 2, 3, 0, 1, 4, 0, 2, 3, 4] [1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1]
....
唯一不同的是一个data
项,其值为0. sparse
将会解决这个问题。