创建一个2d numpy数组来保存字符

时间:2017-06-08 20:31:58

标签: python numpy

我有以下numpy数组和之前的设置,它有一个单词队列和一个临时变量'temp'来存储一个单词。这个词需要逐字逐句地“放”到numpy 2d数组中:

from collections import deque
import numpy as np 
message=input("Write a message:")
wordqueue=message.split()
queue=deque(wordqueue)
print(wordqueue)

for i in range(1):
  temp=wordqueue.pop(0) #store the removed item in the temporary variable 'temp'
print(wordqueue)
print(temp)
display = np.zeros((4,10)) #create a 2d array that is to store the words from the queue
print(display)
display[0, 0] = temp #add the word from the temp variable to fill the array (each character in each sequential position in the array)
print(display)

不幸的是,输出结果如下:

Write a message: This is a message for the display array
['This', 'is', 'a', 'message', 'for', 'the', 'display', 'array']
['is', 'a', 'message', 'for', 'the', 'display', 'array']
This
[[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]]
Traceback (most recent call last):
  File "python", line 20, in <module>
ValueError: could not convert string to float: 'This'

我确实尝试定义了2d数组并定义了数据类型,但这也不是很明显,我不断遇到各种错误。

我想要帮助的是以下内容: 1.理想情况下,我希望numpy数组设置为“*”而不是zeros / 1(文档对此设置没有帮助)。 2.用temp变量替换数组中的* s。每个*的一个字母

实施例

显示数组:  (4 x 20)

* * * * * * * * * * * * * * * * * * * *
* * * * * * * * * * * * * * * * * * * *
* * * * * * * * * * * * * * * * * * * *
* * * * * * * * * * * * * * * * * * * *

输入消息:这是测试消息 temp:这个

更新的显示将显示:

t h i s * * * * * * * * * * * * * * * *
* * * * * * * * * * * * * * * * * * * *
* * * * * * * * * * * * * * * * * * * *
* * * * * * * * * * * * * * * * * * * *

对于后续的话,它会填充数组(如果单词太大则截断,如有必要则转到下一行)

到目前为止: https://repl.it/IcJ3/7

我试过这个,例如,创建一个char数组:

display = np.chararray((4,10)) #create a 2d array that is to store the letters in the words from the queue
display[:]="*"

但它产生了这个,插入了错误的“b”。看不出为什么......

[[b'*' b'*' b'*' b'*' b'*' b'*' b'*' b'*' b'*' b'*']
 [b'*' b'*' b'*' b'*' b'*' b'*' b'*' b'*' b'*' b'*']
 [b'*' b'*' b'*' b'*' b'*' b'*' b'*' b'*' b'*' b'*']
 [b'*' b'*' b'*' b'*' b'*' b'*' b'*' b'*' b'*' b'*']]

在这里更新(处理)repl.it: https://repl.it/IcJ3/8

2 个答案:

答案 0 :(得分:1)

首先,如果你想要一个“字符”数组,你必须要小心你所期望的。在Python 3中,字符串现在是 unicode代码点的序列。在Python 2中,字符串是来自C语言的经典“字节序列”字符串。这意味着,从内存pov开始,unicode类型可能会占用大量内存:

In [1]: import numpy as np

In [2]: chararray = np.zeros((4,10), dtype='S1')

In [3]: unicodearray =  np.zeros((4,10), dtype='U1')

In [4]: chararray.itemsize, unicodearray.itemsize
Out[4]: (1, 4)

In [5]: chararray.nbytes
Out[5]: 40

In [6]: unicodearray.nbytes
Out[6]: 160

因此,如果您知道只想使用ascii字符,则可以使用S1 dtype将内存使用量减少到1/4。另请注意,由于Python 3中的S1实际上对应于bytes数据类型(与Python 2 str完全相同),因此表示前缀为b,所以b'this is a bytes object'

In [7]: chararray
Out[7]:
array([[b'', b'', b'', b'', b'', b'', b'', b'', b'', b''],
       [b'', b'', b'', b'', b'', b'', b'', b'', b'', b''],
       [b'', b'', b'', b'', b'', b'', b'', b'', b'', b''],
       [b'', b'', b'', b'', b'', b'', b'', b'', b'', b'']],
      dtype='|S1')

In [8]: unicodearray
Out[8]:
array([['', '', '', '', '', '', '', '', '', ''],
       ['', '', '', '', '', '', '', '', '', ''],
       ['', '', '', '', '', '', '', '', '', ''],
       ['', '', '', '', '', '', '', '', '', '']],
      dtype='<U1')

现在,假设您有一些要为阵列分配消息的有效负载。如果您的消息由可表示为ascii的字符组成,那么您可以使用dtype快速放松:

In [15]: message = 'This'

In [16]: unicodearray.reshape(-1)[:len(message)] = list(message)

In [17]: unicodearray
Out[17]:
array(['T', 'h', 'i', 's', '', '', '', '', '', '', '', '', '', '', '', '',
       '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',
       '', '', '', '', '', '', ''],
      dtype='<U1')

In [18]: chararray.reshape(-1)[:len(message)] = list(message)

In [19]: chararray
Out[19]:
array([[b'T', b'h', b'i', b's', b'', b'', b'', b'', b'', b''],
       [b'', b'', b'', b'', b'', b'', b'', b'', b'', b''],
       [b'', b'', b'', b'', b'', b'', b'', b'', b'', b''],
       [b'', b'', b'', b'', b'', b'', b'', b'', b'', b'']],
      dtype='|S1')

但是,如果的话:

In [22]: message = "กขฃคฅฆงจฉ"

In [23]: len(message)
Out[23]: 9

In [24]: unicodearray.reshape(-1)[:len(message)] = list(message)

In [25]: unicodearray
Out[25]:
array(['ก', 'ข', 'ฃ', 'ค', 'ฅ', 'ฆ', 'ง', 'จ', 'ฉ', '', '', '', '', '', '',
       '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',
       '', '', '', '', '', '', '', ''],
      dtype='<U1')

In [26]: chararray.reshape(-1)[:len(message)] = list(message)
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-26-7d7cdb93de1f> in <module>()
----> 1 chararray.reshape(-1)[:len(message)] = list(message)

UnicodeEncodeError: 'ascii' codec can't encode character '\u0e01' in position 0: ordinal not in range(128)

In [27]:

注意,如果您想使用np.zeros以外的元素初始化数组,可以使用np.full

In [27]: chararray = np.full((4,10), '*', dtype='S1')

In [28]: chararray
Out[28]:
array([[b'*', b'*', b'*', b'*', b'*', b'*', b'*', b'*', b'*', b'*'],
       [b'*', b'*', b'*', b'*', b'*', b'*', b'*', b'*', b'*', b'*'],
       [b'*', b'*', b'*', b'*', b'*', b'*', b'*', b'*', b'*', b'*'],
       [b'*', b'*', b'*', b'*', b'*', b'*', b'*', b'*', b'*', b'*']],
      dtype='|S1')

最后,用for-loops做这个长形式:

In [17]: temp = "a test"

In [18]: display = np.full((4,10), '*', dtype='U1')

In [19]: display
Out[19]:
array([['*', '*', '*', '*', '*', '*', '*', '*', '*', '*'],
       ['*', '*', '*', '*', '*', '*', '*', '*', '*', '*'],
       ['*', '*', '*', '*', '*', '*', '*', '*', '*', '*'],
       ['*', '*', '*', '*', '*', '*', '*', '*', '*', '*']],
      dtype='<U1')

In [20]: it = iter(temp) # give us a single-pass iterator
    ...: for i in range(display.shape[0]):
    ...:     for j, c in zip(range(display.shape[1]), it):
    ...:         display[i, j] = c
    ...:

In [21]: display
Out[21]:
array([['a', ' ', 't', 'e', 's', 't', '*', '*', '*', '*'],
       ['*', '*', '*', '*', '*', '*', '*', '*', '*', '*'],
       ['*', '*', '*', '*', '*', '*', '*', '*', '*', '*'],
       ['*', '*', '*', '*', '*', '*', '*', '*', '*', '*']],
      dtype='<U1')

另一个衡量衡量标准的测试,它跨越行:

In [36]: temp = "this is a test, a test this is"

In [37]: display = np.full((4,10), '*', dtype='U1')

In [38]: it = iter(temp) # give us a single-pass iterator
    ...: for i in range(display.shape[0]):
    ...:     for j, c in zip(range(display.shape[1]), it):
    ...:         display[i, j] = c
    ...:

In [39]: display
Out[39]:
array([['t', 'h', 'i', 's', ' ', 'i', 's', ' ', 'a', ' '],
       ['t', 'e', 's', 't', ',', ' ', 'a', ' ', 't', 'e'],
       ['s', 't', ' ', 't', 'h', 'i', 's', ' ', 'i', 's'],
       ['*', '*', '*', '*', '*', '*', '*', '*', '*', '*']],
      dtype='<U1')

警告传递给zip的参数的顺序很重要,因为it是单遍迭代器:

zip(range(display.shape[1]), it)

它应该是最后一个参数,否则它将跳过行之间的字符!

最后,请注意numpy提供了顺序迭代数组的便利功能:

In [49]: temp = "this is yet another test"

In [50]: display = np.full((4,10), '*', dtype='U1')

In [51]: for c, x in zip(temp, np.nditer(display, op_flags=['readwrite'])):
    ...:     x[...] = c
    ...:

In [52]: display
Out[52]:
array([['t', 'h', 'i', 's', ' ', 'i', 's', ' ', 'y', 'e'],
       ['t', ' ', 'a', 'n', 'o', 't', 'h', 'e', 'r', ' '],
       ['t', 'e', 's', 't', '*', '*', '*', '*', '*', '*'],
       ['*', '*', '*', '*', '*', '*', '*', '*', '*', '*']],
      dtype='<U1')

有一个小问题,必须将op_flags=['readwrite']传递给函数,以确保返回的迭代器允许修改底层数组,但它极大地简化了代码,我们不需要使用单通道迭代器。不过,我仍然更喜欢切片分配。

答案 1 :(得分:0)

从字符串到列表到每个元素包含一个单词的数组:

In [402]: astr = "This is a message for the display array"
In [403]: alist = astr.split()
In [404]: alist
Out[404]: ['This', 'is', 'a', 'message', 'for', 'the', 'display', 'array']
In [405]: arr = np.array(alist)
In [406]: arr
Out[406]: 
array(['This', 'is', 'a', 'message', 'for', 'the', 'display', 'array'], 
      dtype='<U7')
In [407]: arr.shape
Out[407]: (8,)

我使用的是PY3,所以dtype是U7,由np.array自动选择,足以容纳列表中最大的字符串。

对于包含单个字符的数组:

In [408]: carr = np.zeros((4,10), 'U1')
In [409]: carr
Out[409]: 
array([['', '', '', '', '', '', '', '', '', ''],
       ['', '', '', '', '', '', '', '', '', ''],
       ['', '', '', '', '', '', '', '', '', ''],
       ['', '', '', '', '', '', '', '', '', '']], 
      dtype='<U1')
In [410]: carr.fill('*')
In [411]: carr
Out[411]: 
array([['*', '*', '*', '*', '*', '*', '*', '*', '*', '*'],
       ['*', '*', '*', '*', '*', '*', '*', '*', '*', '*'],
       ['*', '*', '*', '*', '*', '*', '*', '*', '*', '*'],
       ['*', '*', '*', '*', '*', '*', '*', '*', '*', '*']], 
      dtype='<U1')

从字符串中创建一个单个字符数组:

In [430]: np.array(list(astr))
Out[430]: 
array(['T', 'h', 'i', 's', ' ', 'i', 's', ' ', 'a', ' ', 'm', 'e', 's',
       's', 'a', 'g', 'e', ' ', 'f', 'o', 'r', ' ', 't', 'h', 'e', ' ',
       'd', 'i', 's', 'p', 'l', 'a', 'y', ' ', 'a', 'r', 'r', 'a', 'y'], 
      dtype='<U1')

将单词列表映射到单个字符数组有点单调乏味。 This进入arr[0,0:4]等等。

这是一种将单词列表映射到数组的方法:

In [462]: alist
Out[462]: ['This', 'is', 'a', 'message', 'for', 'the', 'display', 'array']
In [463]: ''.join(alist)                     # back to one string
Out[463]: 'Thisisamessageforthedisplayarray'
In [464]: np.array(list(''.join(alist)))     # a flat array of char
Out[464]: 
array(['T', 'h', 'i', 's', 'i', 's', 'a', 'm', 'e', 's', 's', 'a', 'g',
       'e', 'f', 'o', 'r', 't', 'h', 'e', 'd', 'i', 's', 'p', 'l', 'a',
       'y', 'a', 'r', 'r', 'a', 'y'], 
      dtype='<U1')
In [465]: _.shape
Out[465]: (32,)

或者我可以将字符列表复制到现有数组中(使用flat将其视为1d):

In [466]: arr = np.zeros((4,10), 'U1')
In [467]: arr.flat[:32] = list(''.join(alist))  
In [468]: arr
Out[468]: 
array([['T', 'h', 'i', 's', 'i', 's', 'a', 'm', 'e', 's'],
       ['s', 'a', 'g', 'e', 'f', 'o', 'r', 't', 'h', 'e'],
       ['d', 'i', 's', 'p', 'l', 'a', 'y', 'a', 'r', 'r'],
       ['a', 'y', '', '', '', '', '', '', '', '']], 
      dtype='<U1')

如果我在单词之间抛出空白:

In [471]: arr.flat[:39] = list(' '.join(alist))
In [472]: arr
Out[472]: 
array([['T', 'h', 'i', 's', ' ', 'i', 's', ' ', 'a', ' '],
       ['m', 'e', 's', 's', 'a', 'g', 'e', ' ', 'f', 'o'],
       ['r', ' ', 't', 'h', 'e', ' ', 'd', 'i', 's', 'p'],
       ['l', 'a', 'y', ' ', 'a', 'r', 'r', 'a', 'y', '']], 
      dtype='<U1')