Question

是否可以用Python中的有序数字替换2D数组的列中的字符串值？

例如说您有一个2D数组：

a = np.array([['A',0,'C'],['A',0.3,'B'],['D',1,'D']])
a
Out[57]: 
array([['A', '0', 'C'],
       ['A', '0.3', 'B'],
       ['D', '1', 'D']], dtype='<U3')

如果我想将第一列中的字符串值'A'，'A'，'D'替换为数字0,0,1并将'C'，'B'，'D'替换为0,1 ，2是一种有效的方法。

了解以下信息可能会有所帮助

不同列中的替换编号与列无关。也就是说，每一个用数字替换字符串的列都将从0开始，并增加到该列中唯一值的数量。
上面是一个测试用例，带有更多列的字符串，实际数据要大得多。

以下是解决此问题的示例方法，我很快想到了：

for  j in range(a.shape[1]):
    b = list(set(a[:,j]))
    length = len(b)
    for i in range(len(b)):
        indices = np.where(a[:,j]==b[i])[0]
        print(indices)
        a[indices,j]=i

但是，这似乎是实现此目的的一种无效方法，并且也无法区分列中的浮点值或字符串值，并且默认情况下无法用数字字符串替换值：

a
Out[91]: 
array([['1.0', '0.0', '2.0'],
       ['1.0', '1.0', '0.0'],
       ['0.0', '2.0', '1.0']], dtype='<U3')

在此问题上的任何帮助将不胜感激！

Answer 1

您似乎正在尝试进行label encoding。

我可以想到两个选择：pandas.factorize和sklearn.preprocessing.LabelEncoder。

使用`LabelEncoder`

from sklearn.preprocessing import LabelEncoder

b = np.zeros_like(a, np.int) 
for column in range(a.shape[1]):
    b[:, column] = LabelEncoder().fit_transform(a[:, column])

然后b将是：

array([[0, 0, 1],
       [0, 1, 0],
       [1, 2, 2]])

如果希望能够返回到原始值，则需要保存编码器。您可以这样操作：

from sklearn.preprocessing import LabelEncoder

encoders = {}
b = np.zeros_like(a, np.int)
for column in range(a.shape[1]):
    encoders[column] = LabelEncoder()
    b[:, column] = encoders[column].fit_transform(a[:, column])

现在encoders[0].classes_将具有：

array(['A', 'D'], dtype='<U3')

这意味着'A'映射到0，而'D'映射到1。

最后，如果您进行编码覆盖a而不是使用新的矩阵c，则将获得整数作为字符串（"1"而不是1），可以使用astype(int解决此问题：

encoders = {}
for column in range(a.shape[1]):
    encoders[column] = LabelEncoder()
    a[:, column] = encoders[column].fit_transform(a[:, column])

# At this point, a will have strings instead of ints because a had type str
# array([['0', '0', '1'],
#       ['0', '1', '0'],
#       ['1', '2', '2']], dtype='<U3')

a = a.astype(int)

# Now `a` is of type int
# array([[0, 0, 1],
#        [0, 1, 0],
#        [1, 2, 2]])

使用`pd.factorize`

factorize返回编码列和编码映射，因此，如果您不关心它，可以避免保存它：

for column in range(a.shape[1]):
    a[:, column], _ = pd.factorize(a[:, column]) # Drop mapping

a = a.astype(int) # same as above, it's of type str
# a is
# array([[0, 0, 1],
#        [0, 1, 0],
#        [1, 2, 2]])

如果要保留编码映射：

mappings = []
for column in range(a.shape[1]):
    a[:, column], mapping = pd.factorize(a[:, column])
    mappings.append(mapping)

a = a.astype(int)

现在mappings[0]将具有以下数据：

array(['A', 'D'], dtype=object)

与sklearn的LabelEncoder解决方案的encoders[0].classes_具有相同的语义。

Answer 2

只需Numpy，您就可以高效地完成所需的工作。

基本上，您在输入的每一列中的值上进行迭代，同时在集合或字典中跟踪观察到的字母。这与您已经拥有的类似，但是效率略高（避免一件事调用np.where）。

这是一个charToIx函数，可以完成您想要的操作：

from collections import defaultdict
from string import ascii_letters

class Ix:
    def __init__(self):
        self._val = 0

    def __call__(self):
        val = self._val
        self._val += 1
        return val

def charToIx(arr, dtype=None, out=None):
    if dtype is None:
        dtype = arr.dtype

    if out is None:
        out = np.zeros(arr.shape, dtype=dtype)

    for incol,outcol in zip(arr.T, out.T):
        ix = Ix()
        cixDict = defaultdict(lambda: ix())
        for i,x in enumerate(incol):
            if x in cixDict or x in ascii_letters:
                outcol[i] = cixDict[x]
            else:
                outcol[i] = x

    return out

在调用函数时指定输出数组的类型。所以输出：

a = np.array([['A',0,'C'],['A',0.3,'B'],['D',1,'D']])
print(charToIx(a, dtype=float))

将是一个float数组：

array([[0. , 0. , 0. ],
       [0. , 0.3, 1. ],
       [1. , 1. , 2. ]])

基于唯一值的列字符串转换

2 个答案:

使用`LabelEncoder`

使用`pd.factorize`

基于唯一值的列字符串转换

2 个答案:

使用LabelEncoder

使用pd.factorize

使用`LabelEncoder`

使用`pd.factorize`