熊猫将列分成多个

时间:2015-11-14 10:00:51

标签: python csv pandas

没有找到解决我问题的解决方案。

在我的数据集中,我有一个包含天气事件功能的专栏。我需要将它转换为多个数字列 - 指标。我搜索快速解决方案

weather = pd.read_csv("weather.csv", parse_dates=[0])

事件列如下所示:

id                    Events
0                       Rain
...
1                       Rain
...
8                   Fog-Rain
9                  Rain-Snow

我需要将其转换为4个功能:

events = ['Rain','Snow','Fog','Thunderstorm']

每个可以取2个值 - 1或0。

我怎么能用熊猫来做呢?

2 个答案:

答案 0 :(得分:3)

str.get_dummies处理得非常干净:

import pandas as pd

events_list = ['Rain', 'Rain', 'Fog-Rain', 'Rain-Snow', 'Thunderstorm', 'Fog-Thunderstorm']

weather_df = pd.DataFrame(events_list, columns=['Events'])

print(weather_df)

输出:

             Events
0              Rain
1              Rain
2          Fog-Rain
3         Rain-Snow
4      Thunderstorm
5  Fog-Thunderstorm

我们使用str.get_dummies并将其加入原始数据框:

weather_df = pd.concat([weather_df, weather_df.Events.str.get_dummies(sep='-')], axis=1)
print(weather_df)

输出:

             Events  Fog  Rain  Snow  Thunderstorm
0              Rain    0     1     0             0
1              Rain    0     1     0             0
2          Fog-Rain    1     1     0             0
3         Rain-Snow    0     1     1             0
4      Thunderstorm    0     0     0             1
5  Fog-Thunderstorm    1     0     0             1

如果您愿意,可以轻松删除原始列。

答案 1 :(得分:1)

因为,事件有部分单词,如果您使用它,则无法使用get_dummes将为所有可能的组合创建一列。使用str.contains()查找匹配项并创建列。

我使用0表示true,-1表示false,但您可以互换

df
Out[48]: 
   id        Events
0   0          Rain
1   1          Rain
2   8      Fog-Rain
3   9     Rain-Snow
4  32  Thunderstorm
5  31           Fog
6  23          Snow

df.Events.str.contains("Rain")
Out[49]: 
0     True
1     True
2     True
3     True
4    False
5    False
6    False
Name: Events, dtype: bool

df.loc[df.Events.str.contains("Rain"), "Rain"] = 0

df
Out[51]: 
   id        Events  Rain
0   0          Rain     0
1   1          Rain     0
2   8      Fog-Rain     0
3   9     Rain-Snow     0
4  32  Thunderstorm   NaN
5  31           Fog   NaN
6  23          Snow   NaN

df.loc[df.Events.str.contains("Snow"), "Snow"] = 0

df
Out[53]: 
   id        Events  Rain  Snow
0   0          Rain     0   NaN
1   1          Rain     0   NaN
2   8      Fog-Rain     0   NaN
3   9     Rain-Snow     0     0
4  32  Thunderstorm   NaN   NaN
5  31           Fog   NaN   NaN
6  23          Snow   NaN     0

df.loc[df.Events.str.contains("Thunderstorm"), "Thunderstorm"] = 0

df
Out[55]: 
   id        Events  Rain  Snow  Thunderstorm
0   0          Rain     0   NaN           NaN
1   1          Rain     0   NaN           NaN
2   8      Fog-Rain     0   NaN           NaN
3   9     Rain-Snow     0     0           NaN
4  32  Thunderstorm   NaN   NaN             0
5  31           Fog   NaN   NaN           NaN
6  23          Snow   NaN     0           NaN

df.loc[df.Events.str.contains("Fog"), "Fog"] = 0

df
Out[57]: 
   id        Events  Rain  Snow  Thunderstorm  Fog
0   0          Rain     0   NaN           NaN  NaN
1   1          Rain     0   NaN           NaN  NaN
2   8      Fog-Rain     0   NaN           NaN    0
3   9     Rain-Snow     0     0           NaN  NaN
4  32  Thunderstorm   NaN   NaN             0  NaN
5  31           Fog   NaN   NaN           NaN    0
6  23          Snow   NaN     0           NaN  NaN

df = df.fillna(-1)

df
Out[59]: 
   id        Events  Rain  Snow  Thunderstorm  Fog
0   0          Rain     0    -1            -1   -1
1   1          Rain     0    -1            -1   -1
2   8      Fog-Rain     0    -1            -1    0
3   9     Rain-Snow     0     0            -1   -1
4  32  Thunderstorm    -1    -1             0   -1
5  31           Fog    -1    -1            -1    0
6  23          Snow    -1     0            -1   -1