如何读取包含文本的文件,这些文本的长度可能与ascii数字的长度不同。
离
APORRADASD
ASDSDASD
作为
0 -> [065,080,079,082,082,065,068,065,083,068]
1 -> [065,083,068,083,068,065,083,068]
然后将其减去65(因为只有大写字母存在)。 当我尝试使用for循环读取每一行然后减去使得该过程非常慢(因为文件大约300 + MB)
data = np.genfromtxt('data.txt', dtype=np.object_)
for x in range(0,data.shape[0]):
data[x] = [c - 65 for c in data[x]]
答案 0 :(得分:0)
试试这个:
ascii_conv = []
foo = "APORRADASD"
bar = list(foo)
for n in bar:
ascii_conv.append(ord(n))
答案 1 :(得分:0)
我将你的样本复制了1000次并测试了几种选择。
'all in one loop'是一起阅读和解析这些行:
In [159]: with open('stack48307975.txt','rb') as f:
...: data=[np.array(list(l.strip()))-65 for l in f]
...:
In [160]: len(data)
Out[160]: 4000
In [161]: data[:3]
Out[161]:
[array([ 0, 15, 14, 17, 17, 0, 3, 0, 18, 3]),
array([ 0, 18, 3, 18, 3, 0, 18, 3]),
array([ 0, 15, 14, 17, 17, 0, 3, 0, 18, 3])]
节省了getfromtxt
开销。
阅读所有内容,然后解析:
In [162]: with open('stack48307975.txt','rb') as f:
...: dataB = f.readlines()
...:
In [163]: len(dataB)
Out[163]: 4000
In [164]: [[c-65 for c in row.strip()] for row in dataB][:3]
Out[164]:
[[0, 15, 14, 17, 17, 0, 3, 0, 18, 3],
[0, 18, 3, 18, 3, 0, 18, 3],
[0, 15, 14, 17, 17, 0, 3, 0, 18, 3]]
在Py3中,我必须使用rb
打开以获取字节串,然后允许我们使用c-65
进行转换。如果我使用r
模式加载,我会获得unicode字符串,并且必须使用ord(c)
。
我可以read
将整个文件放入一个字符串中,并将其转换为数组:
In [165]: with open('stack48307975.txt','rb') as f:
...: dataAll=f.read()
...:
In [166]: len(dataAll)
Out[166]: 40000
In [167]: dataAll[:13]
Out[167]: b'APORRADASD\nAS'
In [168]: dataOne =np.fromiter(dataAll,dtype='uint8')
In [169]: dataOne.shape
Out[169]: (40000,)
In [170]: dataOne[:13]
Out[170]: array([65, 80, 79, 82, 82, 65, 68, 65, 83, 68, 10, 65, 83], dtype=uint8)
但是把它分成几行需要一点小事。
In [171]: idx = np.where(dataOne==10)[0]
In [172]: idx
Out[172]: array([ 10, 19, 30, ..., 39979, 39990, 39999], dtype=int32)
In [175]: np.split(dataOne-65, idx)[:3]
Out[175]:
[array([ 0, 15, 14, 17, 17, 0, 3, 0, 18, 3], dtype=uint8),
array([201, 0, 18, 3, 18, 3, 0, 18, 3], dtype=uint8),
array([201, 0, 15, 14, 17, 17, 0, 3, 0, 18, 3], dtype=uint8)]
或者更好地处理\n
:
In [178]: idx0=[0]+(idx[:-1]+1).tolist()
In [179]: [dataOne[i:j]-65 for i,j in zip(idx0, idx)][:4]
Out[179]:
[array([ 0, 15, 14, 17, 17, 0, 3, 0, 18, 3], dtype=uint8),
array([ 0, 18, 3, 18, 3, 0, 18, 3], dtype=uint8),
array([ 0, 15, 14, 17, 17, 0, 3, 0, 18, 3], dtype=uint8),
array([ 0, 18, 3, 18, 3, 0, 18, 3], dtype=uint8)]
我已经做了一些时间安排,但如果你做了自己的事情会更有意义。