我正致力于处理稀疏矩阵的软件。它们并不大(从15x15到300x300不等)。我希望能够将矩阵的表示存储在一个短字符串中,以便我可以将其作为值存储在CSV文件中(以及许多其他内容)。
到目前为止我尝试过将矩阵视为二进制字符串,convert to base62:
import numpy as np
import networkx as nx
def graphToHash(a,numnodes):
def baseN(num,b,numerals="0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"):
return ((num == 0) and numerals[0]) or (baseN(num // b, b, numerals).lstrip(numerals[0]) + numerals[num % b])
return str(numnodes) + '!' + baseN(int(''.join([str(i) for i in flatten_list(a)]),2), 62)
def flatten_list(l):
l1=[item for sublist in l if isinstance(sublist,list) or isinstance(sublist,np.ndarray) for item in sublist]
l=l1+[item for item in l if not isinstance(item,list) and not isinstance(item,np.ndarray)]
return l
# example
import sys
sys.setrecursionlimit(10000)
a=np.array(nx.to_numpy_matrix(nx.connected_watts_strogatz_graph(160,8,.3,1000))).astype(int)
hash=graphToHash(a,160)
len(hash) # ~4300 characters
这适用于小图形(30个节点约150个字符)。然而,较大的图形有点笨重(160个节点是~4300并且需要我增加递归限制)。
因为图形是二元和稀疏的,所以我知道我可以做得更好。理想情况下,我想继续使用{0-9,a-z,A-Z}的字符串,因为我知道这些字符串不会在我的CSV文件中造成任何问题。
压缩二进制稀疏矩阵的最有效方法是什么?
答案 0 :(得分:2)
经过我们在评论中的长时间讨论,我记得这是一个二进制数组... derp运行长度编码:
def brle(decoded): #binary run length encoding
run = 0
encoded = []
for i in decoded:
if i:
encoded.append(run)
run = 0
else:
run += 1
return encoded
def brld(encoded): #binary run length decoding
decoded = np.zeros(sum(encoded)+len(encoded)+1) #random trickery to get original length of flat list
pos = 0
for run in encoded:
pos += run
decoded[pos] = 1
pos += 1
return decoded
没有任何字母数字编码...
a=np.array(nx.to_numpy_matrix(nx.connected_watts_strogatz_graph(160,4,.3,1000))).astype(int)
b = flatten_list(a)
encoded = brle(b)
len(';'.join([str(x) for x in encoded])) # ==1706 chars
c = brld(encoded)
assert(all(b==c)) # passes
使用utf-8编码:
s = ''.join(unichr(x).encode('utf-8') for x in encoded) #711 bytes in memory
assert(encoded == [ord(x) for x in s.decode('utf-8')]) # passes
答案 1 :(得分:2)
如何使用sparse6
格式?它使用可打印的ASCII字符。
http://users.cecs.anu.edu.au/~bdm/data/formats.txt
import networkx as nx
G = nx.connected_watts_strogatz_graph(160,4,.3,1000)
s = nx.generate_sparse6(G)
print(len(s))
print(s)
505
>>sparse6<<:~?A__O??K@?SA?[B__D_kE?{F@CH`KI@[J`_L`gM`{NACOaGQ`?QA[R_oIAcTAsUA{VBCWBKXBSHBOZbW[ac\BsSBo^cO_CKacOb_?bCcccgedGfd?hdSiD[jDcldsmdwo_GTE?peGqe[rEcsEktb_^E{vcGwfGyfg{d_{FkAFg}_ObFs~gKgFH@GS\DhAG\BglDGtEG{AG|GhSmHPJhXKhkuHlNiO?CPOIKMILQI\RIdSIlTItUI|VJDMJ@XjHYbwjJPZcxZ_WPc`\Js`FX^_`__`CKL`k\bKc\IXcKldKsbDpfk|WLLhdPeLPjl[nHHllhmf`fLpnl{lMLCMLqM[?Eprm`tmtuM|vNDwNLxNTyN\rN[mKH{i@|n{~LP~OCeDI?oUA_YBOeBOcMOiEosjOyGpAHghZPUIh@sNYKeIKhqMpiNcwzPyO`QOQMS??~QQR`iRql{QsXQqVpqVjhmREXRSyR\NNqZRc?@a[Rk@Jq]
答案 2 :(得分:0)
由于图形预计是稀疏的,我将编码其基于邻接列表的表示。这样的事情(注意我重用了你的baseN()
版本,但我会用迭代版本替换它):
#!/usr/bin/env python3
def baseN(num,b,numerals="0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"):
return ((num == 0) and numerals[0]) or (baseN(num // b, b, numerals).lstrip(numerals[0]) + numerals[num % b])
def encode_graph(g):
# the leading 'a' is needed to protect the leading zero (if any)
s = 'a' + 'a'.join(['a'.join(map(str,x)) for x in g])
n = int(s, 11)
return baseN(n, 62)
print(encode_graph([(0,1), (1,5), (1,23), (5,23)])) # outputs 64wc3BssnTd