我有一个看起来像这样的数据集 -
yyyy month tmax tmin
0 1908 January 5.0 -1.4
1 1908 February 7.3 1.9
2 1908 March 6.2 0.3
3 1908 April 7.4 2.1
4 1908 May 16.5 7.7
5 1908 June 17.7 8.7
6 1908 July 20.1 11.0
7 1908 August 17.5 9.7
8 1908 September 16.3 8.4
9 1908 October 14.6 8.0
10 1908 November 9.6 3.4
11 1908 December 5.8 -0.3
12 1909 January 5.0 0.1
13 1909 February 5.5 -0.3
14 1909 March 5.6 -0.3
15 1909 April 12.2 3.3
16 1909 May 14.7 4.8
17 1909 June 15.0 7.5
18 1909 July 17.3 10.8
19 1909 August 18.8 10.7
20 1909 September 14.5 8.1
21 1909 October 12.9 6.9
22 1909 November 7.5 1.7
23 1909 December 5.3 0.4
24 1910 January 5.2 -0.5
...
它有四个变量 - yyyy
,month
,tmax
(最高温度)和tmin
我想在预测时使用月份列作为变量,因此希望将其转换为二进制编码版本。基本上,我想将12个变量添加到名为January
的数据集中,直到December
,如果某个特定行有月份为" 1月"然后,列January
应标记为1
,新添加的11列的剩余部分应为0
。
我查看了数据透视表,但这对我的事业没有帮助。关于如何以简单优雅的方式做到这一点的任何想法?
答案 0 :(得分:6)
我认为你需要get_dummies
:
df = pd.get_dummies(df['month'])
如果需要将新列添加到原始列并删除month
,请join
使用pop
:
df2 = df.join(pd.get_dummies(df.pop('month')))
print (df2.head())
yyyy tmax tmin April August December February January July June \
0 1908 5.0 -1.4 0 0 0 0 1 0 0
1 1908 7.3 1.9 0 0 0 1 0 0 0
2 1908 6.2 0.3 0 0 0 0 0 0 0
3 1908 7.4 2.1 1 0 0 0 0 0 0
4 1908 16.5 7.7 0 0 0 0 0 0 0
March May November October September
0 0 0 0 0 0
1 0 0 0 0 0
2 1 0 0 0 0
3 0 0 0 0 0
4 0 1 0 0 0
如果不需要删除列month
:
df2 = df.join(pd.get_dummies(df['month']))
print (df2.head())
yyyy month tmax tmin April August December February January \
0 1908 January 5.0 -1.4 0 0 0 0 1
1 1908 February 7.3 1.9 0 0 0 1 0
2 1908 March 6.2 0.3 0 0 0 0 0
3 1908 April 7.4 2.1 1 0 0 0 0
4 1908 May 16.5 7.7 0 0 0 0 0
July June March May November October September
0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0
2 0 0 1 0 0 0 0
3 0 0 0 0 0 0 0
4 0 0 0 1 0 0 0
如果需要排序列,则有更多可能的解决方案 - 使用reindex
或reindex_axis
:
months = ['January', 'February', 'March','April' ,'May', 'June', 'July', 'August', 'September','October', 'November','December']
df1 = pd.get_dummies(df['month']).reindex_axis(months, 1)
print (df1.head())
January February March April May June July August September \
0 1 0 0 0 0 0 0 0 0
1 0 1 0 0 0 0 0 0 0
2 0 0 1 0 0 0 0 0 0
3 0 0 0 1 0 0 0 0 0
4 0 0 0 0 1 0 0 0 0
October November December
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
df1 = pd.get_dummies(df['month']).reindex(columns=months)
print (df1.head())
January February March April May June July August September \
0 1 0 0 0 0 0 0 0 0
1 0 1 0 0 0 0 0 0 0
2 0 0 1 0 0 0 0 0 0
3 0 0 0 1 0 0 0 0 0
4 0 0 0 0 1 0 0 0 0
October November December
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
或将列month
转换为ordered categorical:
df1 = pd.get_dummies(df['month'].astype('category', categories=months, ordered=True))
print (df1.head())
January February March April May June July August September \
0 1 0 0 0 0 0 0 0 0
1 0 1 0 0 0 0 0 0 0
2 0 0 1 0 0 0 0 0 0
3 0 0 0 1 0 0 0 0 0
4 0 0 0 0 1 0 0 0 0
October November December
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
答案 1 :(得分:3)
IIUC,
您可以使用assign
,**
解包操作符和pd.get_dummies
:
df.assign(**pd.get_dummies(df['month']))
输出:
yyyy month tmax tmin April August December February January \
0 1908 January 5.0 -1.4 0 0 0 0 1
1 1908 February 7.3 1.9 0 0 0 1 0
2 1908 March 6.2 0.3 0 0 0 0 0
3 1908 April 7.4 2.1 1 0 0 0 0
4 1908 May 16.5 7.7 0 0 0 0 0
5 1908 June 17.7 8.7 0 0 0 0 0
6 1908 July 20.1 11.0 0 0 0 0 0
7 1908 August 17.5 9.7 0 1 0 0 0
8 1908 September 16.3 8.4 0 0 0 0 0
9 1908 October 14.6 8.0 0 0 0 0 0
10 1908 November 9.6 3.4 0 0 0 0 0
11 1908 December 5.8 -0.3 0 0 1 0 0
12 1909 January 5.0 0.1 0 0 0 0 1
13 1909 February 5.5 -0.3 0 0 0 1 0
14 1909 March 5.6 -0.3 0 0 0 0 0
15 1909 April 12.2 3.3 1 0 0 0 0
16 1909 May 14.7 4.8 0 0 0 0 0
17 1909 June 15.0 7.5 0 0 0 0 0
18 1909 July 17.3 10.8 0 0 0 0 0
19 1909 August 18.8 10.7 0 1 0 0 0
20 1909 September 14.5 8.1 0 0 0 0 0
21 1909 October 12.9 6.9 0 0 0 0 0
22 1909 November 7.5 1.7 0 0 0 0 0
23 1909 December 5.3 0.4 0 0 1 0 0
24 1910 January 5.2 -0.5 0 0 0 0 1
July June March May November October September
0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0
2 0 0 1 0 0 0 0
3 0 0 0 0 0 0 0
4 0 0 0 1 0 0 0
5 0 1 0 0 0 0 0
6 1 0 0 0 0 0 0
7 0 0 0 0 0 0 0
8 0 0 0 0 0 0 1
9 0 0 0 0 0 1 0
10 0 0 0 0 1 0 0
11 0 0 0 0 0 0 0
12 0 0 0 0 0 0 0
13 0 0 0 0 0 0 0
14 0 0 1 0 0 0 0
15 0 0 0 0 0 0 0
16 0 0 0 1 0 0 0
17 0 1 0 0 0 0 0
18 1 0 0 0 0 0 0
19 0 0 0 0 0 0 0
20 0 0 0 0 0 0 1
21 0 0 0 0 0 1 0
22 0 0 0 0 1 0 0
23 0 0 0 0 0 0 0
24 0 0 0 0 0 0 0