Python将文件读入numpy数组列表?

时间:2018-01-17 18:45:05

标签: python numpy

如何读取包含文本的文件,这些文本的长度可能与ascii数字的长度不同。

APORRADASD
ASDSDASD

作为

0 -> [065,080,079,082,082,065,068,065,083,068]
1 -> [065,083,068,083,068,065,083,068]

然后将其减去65(因为只有大写字母存在)。 当我尝试使用for循环读取每一行然后减去使得该过程非常慢(因为文件大约300 + MB)

data = np.genfromtxt('data.txt', dtype=np.object_)
for x in range(0,data.shape[0]):
    data[x] = [c - 65 for c in data[x]]

2 个答案:

答案 0 :(得分:0)

试试这个:

ascii_conv = []
foo = "APORRADASD"
bar = list(foo)
for n in bar:
    ascii_conv.append(ord(n))

答案 1 :(得分:0)

我将你的样本复制了1000次并测试了几种选择。

'all in one loop'是一起阅读和解析这些行:

In [159]: with open('stack48307975.txt','rb') as f:
     ...:     data=[np.array(list(l.strip()))-65 for l in f]
     ...:        
In [160]: len(data)
Out[160]: 4000
In [161]: data[:3]
Out[161]: 
[array([ 0, 15, 14, 17, 17,  0,  3,  0, 18,  3]),
 array([ 0, 18,  3, 18,  3,  0, 18,  3]),
 array([ 0, 15, 14, 17, 17,  0,  3,  0, 18,  3])]

节省了getfromtxt开销。

大致相当于

阅读所有内容,然后解析:

In [162]: with open('stack48307975.txt','rb') as f:
     ...:     dataB = f.readlines()
     ...:    
In [163]: len(dataB)
Out[163]: 4000
In [164]: [[c-65 for c in row.strip()] for row in dataB][:3]
Out[164]: 
[[0, 15, 14, 17, 17, 0, 3, 0, 18, 3],
 [0, 18, 3, 18, 3, 0, 18, 3],
 [0, 15, 14, 17, 17, 0, 3, 0, 18, 3]]

在Py3中,我必须使用rb打开以获取字节串,然后允许我们使用c-65进行转换。如果我使用r模式加载,我会获得unicode字符串,并且必须使用ord(c)

我可以read将整个文件放入一个字符串中,并将其转换为数组:

In [165]: with open('stack48307975.txt','rb') as f:
     ...:     dataAll=f.read()
     ...:         
In [166]: len(dataAll)
Out[166]: 40000
In [167]: dataAll[:13]
Out[167]: b'APORRADASD\nAS'
In [168]: dataOne =np.fromiter(dataAll,dtype='uint8')
In [169]: dataOne.shape
Out[169]: (40000,)
In [170]: dataOne[:13]
Out[170]: array([65, 80, 79, 82, 82, 65, 68, 65, 83, 68, 10, 65, 83], dtype=uint8)

但是把它分成几行需要一点小事。

In [171]: idx = np.where(dataOne==10)[0]
In [172]: idx
Out[172]: array([   10,    19,    30, ..., 39979, 39990, 39999], dtype=int32)

In [175]: np.split(dataOne-65, idx)[:3]
Out[175]: 
[array([ 0, 15, 14, 17, 17,  0,  3,  0, 18,  3], dtype=uint8),
 array([201,   0,  18,   3,  18,   3,   0,  18,   3], dtype=uint8),
 array([201,   0,  15,  14,  17,  17,   0,   3,   0,  18,   3], dtype=uint8)]

或者更好地处理\n

In [178]: idx0=[0]+(idx[:-1]+1).tolist()
In [179]: [dataOne[i:j]-65 for i,j in zip(idx0, idx)][:4]
Out[179]: 
[array([ 0, 15, 14, 17, 17,  0,  3,  0, 18,  3], dtype=uint8),
 array([ 0, 18,  3, 18,  3,  0, 18,  3], dtype=uint8),
 array([ 0, 15, 14, 17, 17,  0,  3,  0, 18,  3], dtype=uint8),
 array([ 0, 18,  3, 18,  3,  0, 18,  3], dtype=uint8)]

我已经做了一些时间安排,但如果你做了自己的事情会更有意义。