Question

我有这个数据框：

df = pd.DataFrame([['137', 'earn'], ['158', 'earn'],['144', 'ship'],['111', 'trade'],['132', 'trade']], columns=['value', 'topic'] )
print(df)
    value  topic
0   137   earn
1   158   earn
2   144   ship
3   111  trade
4   132  trade

我想要一个像这样的附加数字列：

    value  topic  topic_id
0   137   earn    0
1   158   earn    0
2   144   ship    1
3   111  trade    2
4   132  trade    2

所以基本上我想生成一个将字符串列编码为数值的列。我实施了这个解决方案：

topics_dict = {}
topics = np.unique(df['topic']).tolist()
for i in range(len(topics)):
        topics_dict[topics[i]] = i
df['topic_id'] = [topics_dict[l] for l in df['topic']]

然而，我确信有更优雅和熊猫的方法可以解决这个问题，但我无法在Google或SO上找到一些东西。我读到了关于熊猫的事情。 get_dummies但这会为原始列中的每个不同值创建多个列。

我感谢任何方向的帮助或指针！

Answer 1

选项1
pd.factorize

df['topic_id'] = pd.factorize(df.topic)[0]
df

  value  topic  topic_id
0   137   earn         0
1   158   earn         0
2   144   ship         1
3   111  trade         2
4   132  trade         2

选项2
np.unique

_, v = np.unique(df.topic, return_inverse=True)
df['topic_id'] = v

df

  value  topic  topic_id
0   137   earn         0
1   158   earn         0
2   144   ship         1
3   111  trade         2
4   132  trade         2

选项3
pd.Categorical

df['topic_id'] = pd.Categorical(df.topic).codes
df

  value  topic  topic_id
0   137   earn         0
1   158   earn         0
2   144   ship         1
3   111  trade         2
4   132  trade         2

选项4
dfGroupBy.ngroup

df['topic_id'] = df.groupby('topic').ngroup()
df

  value  topic  topic_id
0   137   earn         0
1   158   earn         0
2   144   ship         1
3   111  trade         2
4   132  trade         2

Answer 2

您可以使用

In [63]: df['topic'].astype('category').cat.codes
Out[63]:
0    0
1    0
2    1
3    2
4    2
dtype: int8

Answer 3

我们可以使用apply函数根据现有列创建新列。如下所示。

topic_list = list(df["topic"].unique()) df['topic_id'] = df.apply(lambda row: topic_list.index(row["topic"]),axis=1)

Answer 4

可以使用for循环和列表推导来确定代码列表：

ucols = pd.unique(df.topic)
df['topic_id'] = [ j
                for i in range(len(df.topic))
                for j in range(len(ucols))
                if df.topic[i] == ucols[j]  ]
print(df)

输出：

  value  topic  topic_id
0   137   earn         0
1   158   earn         0
2   144   ship         1
3   111  trade         2
4   132  trade         2

Answer 5

试试此代码

 df['topic_id'] = pd.Series([0,0,1,2,2], index=df.index)

效果很好

   value  topic
0   137   earn
1   158   earn
2   144   ship
3   111  trade
4   132  trade
  value  topic  topic_id
0   137   earn         0
1   158   earn         0
2   144   ship         1
3   111  trade         2
4   132  trade         2

根据其他文本列将数字列添加到pandas数据框中

5 个答案: