Question

我有一个包含1列的pandas Dataframe，其中包含一串位，例如'100100101'。我想将此字符串转换为numpy数组。

我该怎么做？

编辑：

使用

features = df.bit.apply(lambda x: np.array(list(map(int,list(x)))))
#...
model.fit(features, lables)

导致model.fit：

出错

ValueError: setting an array element with a sequence.

对于我的案例有效的解决方案我想出了明确的答案：

for bitString in input_table['Bitstring'].values:
    bits = np.array(map(int, list(bitString)))
    featureList.append(bits)
features = np.array(featureList)
#....
model.fit(features, lables)

Answer 1

对于字符串s = "100100101"，您可以至少以两种不同的方式将其转换为numpy数组。

首先使用numpy的fromstring方法。这有点尴尬，因为你必须指定数据类型并减去＆＃34; base＆＃34;元素的价值。

import numpy as np

s = "100100101"
a = np.fromstring(s,'u1') - ord('0')

print a  # [1 0 0 1 0 0 1 0 1]

其中'u1'是数据类型，而ord('0')用于减去＆＃34; base＆＃34;每个元素的价值。

第二种方法是将每个字符串元素转换为整数（因为字符串是可迭代的），然后将该列表传递给np.array：

import numpy as np

s = "100100101"
b = np.array(map(int, s))

print b  # [1 0 0 1 0 0 1 0 1]

然后

# To see its a numpy array:
print type(a)  # <type 'numpy.ndarray'>
print a[0]     # 1
print a[1]     # 0
# ...

注意，随着输入字符串s的长度增加，第二种方法比第一种方法显着更差。对于小字符串，它很接近，但考虑90个字符的字符串的timeit结果（我刚刚使用s * 10）：

fromstring: 49.283392424 s
map/array:   2.154540959 s

（这是使用默认的timeit.repeat参数，最少3次运行，每次运行计算运行1M string-＆gt;数组转换的时间）

Answer 2

一个pandas方法是在df列上调用apply来执行转换：

In [84]:

df = pd.DataFrame({'bit':['100100101']})
t = df.bit.apply(lambda x: np.array(list(map(int,list(x)))))
t[0]
Out[84]:
array([1, 0, 0, 1, 0, 0, 1, 0, 1])

Answer 3

选中unpackbits

>>> np.unpackbits(np.array([int('010101',2)], dtype=np.uint8))
array([0, 0, 0, 1, 0, 1, 0, 1], dtype=uint8)

更普遍的是：

>>> a = np.array([[2], [7], [23]], dtype=np.uint8)
>>> a
array([[ 2],
       [ 7],
       [23]], dtype=uint8)
>>> b = np.unpackbits(a, axis=1)
>>> b
array([[0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 1, 1, 1],
       [0, 0, 0, 1, 0, 1, 1, 1]], dtype=uint8)

如果您需要8位以上，请签出How to extract the bits of larger numeric Numpy data types

将Bitstring（1和0的字符串）转换为numpy数组

3 个答案: