我有一个数据框,如下所示:
Key Color
1 blue
2 yellow
3 red
3 red
4 purple
5 tan
5 blue
5 black
我想将“事件”列分为“年份”和“事件名称”两列,以获取以下输出:
df = pd.DataFrame({
'Event':['2018 Green Meeting','2018 Yellow Meeting','2018 Red Meeting',
'2017 Green Meeting','2017 Yellow Meeting','2017 Red Meeting',
'2016 Green Meeting','2016 Yellow Meeting','2016 Red Meeting',
'Blue Meeting','Purple Meeting','Green Meeting'],
'Count':[1,2,3,4,5,6,7,8,9,10,11,12]
})
当我尝试使用正则表达式执行此操作时。它似乎无法正常工作。我得到两列“年份”和“事件名称”。但是年份是空的。
这就是我正在使用的:
df2 = pd.DataFrame({
'Year':['2018','2018','2018',
'2017','2017','2017',
'2016','2016','2016',
'Blue Meeting','Purple Meeting','Green Meeting'],
'Event_Name':['Green Meeting','Yellow Meeting','Red Meeting',
'Green Meeting','Yellow Meeting','Red Meeting',
'Green Meeting','Yellow Meeting','Red Meeting',
'Blue Meeting','Purple Meeting','Green Meeting'],
'Count':[1,2,3,4,5,6,7,8,9,10,11,12]
})
如何使它正常工作?
答案 0 :(得分:7)
将str.extract
与fillna
一起使用
df['Year']=df.Event.str.extract('(\d+)').fillna(df.Event)
然后我们做replace
df['even_name']=df.Event.str.replace('\d+', '')
答案 1 :(得分:4)
pandas.Series.str.findall
s = df.Event.str.findall('(\d+|\D+)')
pd.DataFrame(dict(
Count=df.Count,
Event_Name=s.str[-1],
Year=s.str[0]
))
Count Event_Name Year
0 1 Green Meeting 2018
1 2 Yellow Meeting 2018
2 3 Red Meeting 2018
3 4 Green Meeting 2017
4 5 Yellow Meeting 2017
5 6 Red Meeting 2017
6 7 Green Meeting 2016
7 8 Yellow Meeting 2016
8 9 Red Meeting 2016
9 10 Blue Meeting Blue Meeting
10 11 Purple Meeting Purple Meeting
11 12 Green Meeting Green Meeting
def f(x):
a, b = x.split(None, 1)
if a.isdecimal():
return a, b
else:
return (x,)
s = df.Event.apply(f)
pd.DataFrame(dict(
Count=df.Count,
Event_Name=s.str[-1],
Year=s.str[0]
))
Count Event_Name Year
0 1 Green Meeting 2018
1 2 Yellow Meeting 2018
2 3 Red Meeting 2018
3 4 Green Meeting 2017
4 5 Yellow Meeting 2017
5 6 Red Meeting 2017
6 7 Green Meeting 2016
7 8 Yellow Meeting 2016
8 9 Red Meeting 2016
9 10 Blue Meeting Blue Meeting
10 11 Purple Meeting Purple Meeting
11 12 Green Meeting Green Meeting
答案 2 :(得分:3)
使用extractall
:
df[['Year','Event']] = df.Event.str.extractall('(\d{4})? ?(.+$)').reset_index('match', drop=True)
输出:
Event Count Year
0 Green Meeting 1 2018
1 Yellow Meeting 2 2018
2 Red Meeting 3 2018
3 Green Meeting 4 2017
4 Yellow Meeting 5 2017
5 Red Meeting 6 2017
6 Green Meeting 7 2016
7 Yellow Meeting 8 2016
8 Red Meeting 9 2016
9 Blue Meeting 10 NaN
10 Purple Meeting 11 NaN
11 Green Meeting 12 NaN
答案 3 :(得分:0)
这应该完成工作
def get_year(x):
try:
return int(x.split()[0])
except:
return None
def get_event_name(x):
try:
year = int(x.split()[0])
return ' '.join(x.split()[1: ])
except:
return x
df['Year'] = df['Event'].apply(lambda x: get_year(x))
df['Event_Name'] = df['Event'].apply(lambda x: get_event_name(x))
df = df.drop(['Event', ], axis=1)