Question

pandas.factorize将输入值编码为枚举类型或分类变量。

但是，如何轻松高效地转换数据框的多列？反向映射步骤怎么样？

示例：此数据框包含具有字符串值的列，例如＆＃34;类型2和＃34;我希望将其转换为数值 - 并可能稍后将其转换回来。

Answer 1

如果您需要分别apply每列，则可以使用factorize：

df = pd.DataFrame({'A':['type1','type2','type2'],
                   'B':['type1','type2','type3'],
                   'C':['type1','type3','type3']})

print (df)
       A      B      C
0  type1  type1  type1
1  type2  type2  type3
2  type2  type3  type3

print (df.apply(lambda x: pd.factorize(x)[0]))
   A  B  C
0  0  0  0
1  1  1  1
2  1  2  1

如果您需要相同的数字字符串值：

print (df.stack().rank(method='dense').unstack())
     A    B    C
0  1.0  1.0  1.0
1  2.0  2.0  3.0
2  2.0  3.0  3.0

如果您只需要为某些列应用该功能，请使用子集：

df[['B','C']] = df[['B','C']].stack().rank(method='dense').unstack()
print (df)
       A    B    C
0  type1  1.0  1.0
1  type2  2.0  3.0
2  type2  3.0  3.0

factorize的解决方案：

stacked = df[['B','C']].stack()
df[['B','C']] = pd.Series(stacked.factorize()[0], index=stacked.index).unstack()
print (df)
       A  B  C
0  type1  0  0
1  type2  1  2
2  type2  2  2

dict可以通过vals = df.stack().drop_duplicates().values b = [x for x in df.stack().drop_duplicates().rank(method='dense')] d1 = dict(zip(b, vals)) print (d1) {1.0: 'type1', 2.0: 'type2', 3.0: 'type3'} df1 = df.stack().rank(method='dense').unstack() print (df1) A B C 0 1.0 1.0 1.0 1 2.0 2.0 3.0 2 2.0 3.0 3.0 print (df1.stack().map(d1).unstack()) A B C 0 type1 type1 type1 1 type2 type2 type3 2 type2 type3 type3转发它们，您需要map删除重复项：

  date_default_timezone_set('Europe/Helsinki');
  echo "date('l'): ".date('l'); // returns Thursday
  echo "date('w'): ".date('w'); // returns 4

  $dt = new DateTime();
  var_dump($dt); // matches local time and date

  object(DateTime)[24]
    public 'date' => string '2016-09-08 14:44:37' (length=19)
    public 'timezone_type' => int 3
    public 'timezone' => string 'Europe/Helsinki' (length=15)

  echo $dt->format('w'); // returns 4

Answer 2

我也觉得这个答案非常有帮助： https://stackoverflow.com/a/20051631/4643212

我试图从Pandas DataFrame（名为＆＃39; SrcIP＆＃39;的IP地址列表）中的现有列中获取值，并将它们映射到新列中的数值（名为＆＃39; ID＆＃39;在这个例子中）。

解决方案：

df['ID'] = pd.factorize(df.SrcIP)[0]

结果：

        SrcIP | ID    
192.168.1.112 |  0  
192.168.1.112 |  0  
192.168.4.118 |  1 
192.168.1.112 |  0
192.168.4.118 |  1
192.168.5.122 |  2
192.168.5.122 |  2
...

Answer 3

我想重定向我的回答：https://stackoverflow.com/a/32011969/1694714

旧答案

此问题的另一个可读解决方案是，如果要在结果DataFrame中保持类别一致，请使用replace：

def categorise(df):
    categories = {k: v for v, k in enumerate(df.stack().unique())}
    return df.replace(categories)

执行@jezrael的示例略差，但更容易阅读。此外，对于更大的数据集，它可能会更好地升级。如果有人有兴趣，我可以做一些适当的测试。

pandas.factorize在整个数据框架上

3 个答案: