方法：

Question

pd.get_dummies允许将分类变量转换为虚拟变量。除了重建分类变量这一事实之外，还有一种首选/快速的方法吗？

Answer 1

这已经有几年了，所以当最初提出这个问题时，这可能不会出现在pandas工具包中，但这种方法对我来说似乎有点容易。 idxmax将返回对应于最大元素的索引（即具有1的元素）。我们执行axis=1因为我们想要1出现的列名称。

编辑：我并不打算将其分类而不仅仅是一个字符串，但你可以像@Jeff那样用pd.Categorical（和pd.Series包装它来做同样的事情，如果需要的话）。

In [1]: import pandas as pd

In [2]: s = pd.Series(['a', 'b', 'a', 'c'])

In [3]: s
Out[3]: 
0    a
1    b
2    a
3    c
dtype: object

In [4]: dummies = pd.get_dummies(s)

In [5]: dummies
Out[5]: 
   a  b  c
0  1  0  0
1  0  1  0
2  1  0  0
3  0  0  1

In [6]: s2 = dummies.idxmax(axis=1)

In [7]: s2
Out[7]: 
0    a
1    b
2    a
3    c
dtype: object

In [8]: (s2 == s).all()
Out[8]: True

编辑以回应@ piRSquared的评论：这个解决方案确实假设每行有一个1。我认为这通常是一种格式。 pd.get_dummies可以返回全部为0的行（如果您有drop_first=True或NaN值和dummy_na=False（默认值）（我缺少的任何情况？）。一行全零将被视为第一列中命名的变量的实例（例如，上例中的a）。

如果drop_first=True，您无法单独从虚拟数据框中了解“第一个”变量的名称是什么，因此除非您保留额外信息，否则操作不可逆;我建议离开drop_first=False（默认）。

由于dummy_na=False是默认值，因此肯定会导致问题。 如果您想使用此解决方案反转“虚假化”并且您的数据包含任何dummy_na=True，请在致电pd.get_dummies时设置NaNs。设置{{1}将总是添加一个“nan”列，即使该列全为0，所以你可能不想设置它，除非你真的有dummy_na=True s。一个不错的方法可能是设置NaN。同样好的是dummies = pd.get_dummies(series, dummy_na=series.isnull().any())解决方案将正确地重新生成idxmax s（不仅仅是一个表示“nan”的字符串）。

还值得一提的是，设置NaN和drop_first=True意味着dummy_na=False与第一个变量的实例无法区分，因此如果您的数据集可能包含任何内容，则强烈建议不要这样做NaN值。

Answer 2

In [46]: s = Series(list('aaabbbccddefgh')).astype('category')

In [47]: s
Out[47]: 
0     a
1     a
2     a
3     b
4     b
5     b
6     c
7     c
8     d
9     d
10    e
11    f
12    g
13    h
dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]

In [48]: df = pd.get_dummies(s)

In [49]: df
Out[49]: 
    a  b  c  d  e  f  g  h
0   1  0  0  0  0  0  0  0
1   1  0  0  0  0  0  0  0
2   1  0  0  0  0  0  0  0
3   0  1  0  0  0  0  0  0
4   0  1  0  0  0  0  0  0
5   0  1  0  0  0  0  0  0
6   0  0  1  0  0  0  0  0
7   0  0  1  0  0  0  0  0
8   0  0  0  1  0  0  0  0
9   0  0  0  1  0  0  0  0
10  0  0  0  0  1  0  0  0
11  0  0  0  0  0  1  0  0
12  0  0  0  0  0  0  1  0
13  0  0  0  0  0  0  0  1

In [50]: x = df.stack()

# I don't think you actually need to specify ALL of the categories here, as by definition
# they are in the dummy matrix to start (and hence the column index)
In [51]: Series(pd.Categorical(x[x!=0].index.get_level_values(1)))
Out[51]: 
0     a
1     a
2     a
3     b
4     b
5     b
6     c
7     c
8     d
9     d
10    e
11    f
12    g
13    h
Name: level_1, dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]

所以我认为我们需要一个功能来做＆＃39;这似乎是一种自然的操作。也许get_categories()，请参阅here

Answer 3

这是一个很晚的答案，但是由于您要求采用一种快速的方式进行操作，因此我认为您正在寻找性能最高的策略。在大型数据帧（例如10000行）上，通过使用np.where而不是idxmax或get_level_values可以大大提高速度，并获得相同的结果。想法是索引虚拟数据帧不为0的列名称：

方法：

使用与@Nathan相同的示例数据：

>>> dummies
   a  b  c
0  1  0  0
1  0  1  0
2  1  0  0
3  0  0  1

s2 = pd.Series(dummies.columns[np.where(dummies!=0)[1]])

>>> s2
0    a
1    b
2    a
3    c
dtype: object

基准：

在小的虚拟数据帧上，您不会发现性能有太大差异。但是，请测试一系列解决此问题的不同策略：

s = pd.Series(np.random.choice(['a','b','c'], 10000))

dummies = pd.get_dummies(s)

def np_method(dummies=dummies):
    return pd.Series(dummies.columns[np.where(dummies!=0)[1]])

def idx_max_method(dummies=dummies):
    return dummies.idxmax(axis=1)

def get_level_values_method(dummies=dummies):
    x = dummies.stack()
    return pd.Series(pd.Categorical(x[x!=0].index.get_level_values(1)))

def dot_method(dummies=dummies):
    return dummies.dot(dummies.columns)

import timeit

# Time each method, 1000 iterations each:

>>> timeit.timeit(np_method, number=1000)
1.0491090340074152

>>> timeit.timeit(idx_max_method, number=1000)
12.119140846014488

>>> timeit.timeit(get_level_values_method, number=1000)
4.109266621991992

>>> timeit.timeit(dot_method, number=1000)
1.6741622970002936

np.where方法比get_level_values方法快4倍，比idxmax方法快11.5倍！它也击败了（但只有一点点）this answer to a similar question

中概述的.dot()方法

它们都返回相同的结果：

>>> (get_level_values_method() == np_method()).all()
True
>>> (idx_max_method() == np_method()).all()
True

Answer 4

设置

使用@Jeff的设置

s = Series(list('aaabbbccddefgh')).astype('category')
df = pd.get_dummies(s)

如果列是字符串

每行只有一个1

df.dot(df.columns)

0     a
1     a
2     a
3     b
4     b
5     b
6     c
7     c
8     d
9     d
10    e
11    f
12    g
13    h
dtype: object

`numpy.where`

再次！假设每行仅1

i, j = np.where(df)
pd.Series(df.columns[j], i)

0     a
1     a
2     a
3     b
4     b
5     b
6     c
7     c
8     d
9     d
10    e
11    f
12    g
13    h
dtype: category
Categories (8, object): [a, b, c, d, e, f, g, h]

`numpy.where`

不假设每行一个1

i, j = np.where(df)
pd.Series(dict(zip(zip(i, j), df.columns[j])))

0   0    a
1   0    a
2   0    a
3   1    b
4   1    b
5   1    b
6   2    c
7   2    c
8   3    d
9   3    d
10  4    e
11  5    f
12  6    g
13  7    h
dtype: object

`numpy.where`

如果我们不假设每行1，并且我们删除索引

i, j = np.where(df)
pd.Series(dict(zip(zip(i, j), df.columns[j]))).reset_index(-1, drop=True)

0     a
1     a
2     a
3     b
4     b
5     b
6     c
7     c
8     d
9     d
10    e
11    f
12    g
13    h
dtype: object

Answer 5

将dat [“ classification”]转换为一种热编码并返回！

import pandas as pd

from sklearn.preprocessing import LabelEncoder

dat["labels"]= le.fit_transform(dat["classification"])

Y= pd.get_dummies(dat["labels"])

tru=[]

for i in range(0, len(Y)):
  tru.append(np.argmax(Y.iloc[i]))

tru= le.inverse_transform(tru)

##Identical check!
(tru==dat["classification"]).value_counts()

Answer 6

如果您要根据不构成分区的某些按行互斥的布尔条件（这些是“虚拟”变量）对数据框中的行进行分类（例如，由于例如，某些数据丢失），最好用pd.Categorical初始化一个np.nan满，然后显式设置每个子集的类别。下面是一个示例。

0。数据设置：

np.random.seed(42)

student_names = list('abcdefghi')
marks = np.random.randint(0, 100, len(student_names)).astype(float)
passes = marks >= 50
marks[[1, 5]] = np.nan  # artificially introduce NAs

students = pd.DataFrame({'mark': marks, 'pass': passes}, index=student_names)

>>> students
   mark   pass
a  51.0   True
b   NaN   True
c  14.0  False
d  71.0   True
e  60.0   True
f   NaN  False
g  82.0   True
h  86.0   True
i  74.0   True

1。计算相关布尔条件的值：

failed = ~students['pass']
barely_passed = students['pass'] & (students['mark'] < 60)
well_passed = students['pass'] & (students['mark'] >= 60)

>>> pd.DataFrame({'f': failed, 'b': barely_passed, 'p': well_passed}).astype(int)
   b  f  p
a  1  0  0
b  0  0  0
c  0  1  0
d  0  0  1
e  0  0  1
f  0  1  0
g  0  0  1
h  0  0  1
i  0  0  1

如您所见，行b的所有三个类别都有False（因为标记是NaN，而pass是True）。

2。生成分类序列：

cat = pd.Series(
    pd.Categorical([np.nan] * len(students), categories=["failed", "barely passed", "well passed"]),
    index=students.index
)
cat[failed] = "failed"
cat[barely_passed] = "barely passed"
cat[well_passed] = "well passed"

>>> cat
a    barely passed
b              NaN
c           failed
d      well passed
e      well passed
f           failed
g      well passed
h      well passed
i      well passed

如您所见，NaN保留在没有应用任何类别的位置。

此方法的性能与使用np.where一样，但是允许可能的NaN的灵活性。

从熊猫中的假人重建一个分类变量

6 个答案:

方法：

基准：

设置

如果列是字符串

`numpy.where`

`numpy.where`

`numpy.where`