Question

以下情况经常来自我的数据分析。假设我从一些观察中得到两个数据向量x和y。 x具有更多数据点，因此包含一些在y中未观察到的值。现在我想把它们变成分类变量。

x=['a','b','c','d','e']  #data points
y =['a','c','e']         #data of the same nature as x but with fewer data points  

fx = pandas.Categorical.from_array(x)
fy = pandas.Categorical.from_array(y)

print fx.index
print fy.index

Categorical: 
array([a, b, c, d, e], dtype=object)
Levels (5): Index([a, b, c, d, e], dtype=object) Categorical: 
array([a, c, e], dtype=object)
Levels (3): Index([a, c, e], dtype=object)

我看到现在它们有不同的级别，标签意味着不同的东西（1表示fx中的b，但fy表示c）。

这显然使得使用fx和fy的代码很难，因为他们期望fx.labels和fy.labels具有相同的编码/含义。

但是我没有看到如何'fx和fy'正常化，以便它们具有相同的级别，而fx.lables和fy.lables具有相同的编码。 fy.labels = fx.lables显然不起作用。如下所示，它改变了标签[a c e]的含义变为[a b c]。

fy.levels = fx.levels
print fy

Categorical: 
array([a, b, c], dtype=object)
Levels (5): Index([a, b, c, d, e], dtype=object)

有没有人有任何想法？

另一个相关场景是我有一个已知的已知索引，并希望将数据分解为此索引。例如，我知道每个数据点必须采用五个值[a，b，c，d，e]中的一个，并且我已经有一个索引Index([a, b, c, d, e], dtype=object)，我想要分解向量y = ['a '，'c'，'e']成为一个以Index([a, b, c, d, e], dtype=object)为级别的分类变量。我不知道如何做到这一点，并希望有人知道提供一些线索。

P.S在R中做这些事情是可能但很麻烦。

谢谢，汤姆

Answer 1

In [6]: fxd = {fx.levels[i]: i for i in range(len(fx.levels))}

In [7]: fy.labels = [fxd[v] for v in fy]

In [8]: fy.levels = fx.levels

In [9]: fy
Out[9]: 
Categorical: 
array([a, c, e], dtype=object)
Levels (5): Index([a, b, c, d, e], dtype=object)

Answer 2

get_indexer()方法可用于创建索引数组：

x=['a','b','c','d','e']  #data points
y =['a','c','e']         #data of the same nature as x but with fewer data points  
idx = pd.Index(pd.unique(x+y))
cx = pd.Categorical(idx.get_indexer(x), idx)
cy = pd.Categorical(idx.get_indexer(y), idx)

Answer 3

关于加勒特的答案：在我的熊猫版本（0.20.3）fx.levels中引发了一个属性错误：＆＃39;分类＆＃39;对象没有属性＆＃39;等级，但有效的是：

missing_levels = set(fx) - set(fy)
fy = fy.add_categories(missing_levels)

或inplace=True（更快一点）：

missing_levels = set(fx) - set(fy)
fy.add_categories(missing_levels, inplace=True)

将分类变量级别更改为我提供的内容/组合级别两个分类变量

3 个答案: