Question

我有一个数据数组，每行代表一个数据样本（5个样本），每列代表数据中的一个特征（每个样本有6个特征）

我正在尝试量化每列包含的状态数，然后将它们映射到一组数字。只有当列当前不是数字时才应该这样做。

通过示例更容易解释：

示例输入（输入类型为numpy.ndarray）：

In = array([['x', 's', 3, 'k', 's', 'u'],
            ['x', 's', 2, 'n', 'n', 'g'],
            ['b', 's', 0, 'n', 'n', 'm'],
            ['k', 'y', 1, 'w', 'v', 'l'],
            ['x', 's', 2, 'o', 'c', 'l']], dtype=object)

第一栏

curr_column = 0
colset = set()
for row in In:
    curr_element = row[curr_column]
    if curr_element not in colset:
        colset.add(curr_element)

#now colset = {'x', 'b', 'k'} so 3 possible states
collist = list(colset) #make it indexible
coldict = {}
for i in range(len(collist)):
    coldict[collist[i]] = i

这会产生一个字典，所以我现在可以重新创建原始数据：（假设coldict = {'x'：0，'b'：1，'k'：2}）

for i in range(len(In)): #loop over each row
    curr_element = In[i][curr_column] #get current element
    In[i][curr_column] = coldict[curr_element] #use it to find the numerical value
'''
now
In = array([[0, 's', 3, 'k', 's', 'u'],
            [0, 's', 2, 'n', 'n', 'g'],
            [1, 's', 0, 'n', 'n', 'm'],
            [2, 'y', 1, 'w', 'v', 'l'],
            [0, 's', 2, 'o', 'c', 'l']], dtype=object)
'''

现在为每一列重复此过程。

我知道我可以通过在数据集上一次填充所有列字典来加快速度，然后在一个循环中替换所有值。为了清楚起见，我把它留在了外面。

这对于空间和时间来说非常低效，并且在大数据上花费了大量时间，在哪些方面可以改进这种算法？ numpy或pandas中是否有映射函数可以实现此目的还是帮助我？

我认为类似于

np.unique(Input, axis=1)

但是我需要这个是可移植的，并不是每个人都有1.13.0开发者版本的numpy。

另外，我如何区分数字列和不确定应该将其应用于哪些列的列？

Answer 1

Pandas还有一个你可以使用的地图功能。所以，例如，如果你有这个字典将字符串映射到代码：

codes = {'x':0, 'b':1, 'k':2}

您可以使用 map 函数映射pandas数据帧中的列：

df[col] = df[col].map(codes)

Answer 2

您可以使用分类代码。请参阅Categorical section of the docs。

In [11]: df
Out[11]:
   0  1  2  3  4  5
0  x  s  3  k  s  u
1  x  s  2  n  n  g
2  b  s  0  n  n  m

In [12]: for col in df.columns:
     ...:     df[col] = pd.Categorical(df[col], categories=df[col].unique()).codes

In [13]: df
Out[13]:
   0  1  2  3  4  5
0  0  0  0  0  0  0
1  0  0  1  1  1  1
2  1  0  2  1  1  2
3  2  1  3  2  2  3
4  0  0  1  3  3  3

我怀疑还有一个很小的变化，允许这样做而不明确地传递类别（注意：pandas 确实保证.unique()处于看见顺序中）

注意：要＆＃34;区分数字列和不是＆＃39;＃34;你可以在迭代之前使用select_dtypes：

for col in df.select_dtypes(exclude=['int']).columns:
    ...

使用pandas和numpy将字符串类别映射到数字

2 个答案: