如何将基于字符串的列(如果存在)拆分为单独的列

时间:2018-09-24 18:07:37

标签: python regex pandas

我有一个数据框,如下所示:

Key Color
1   blue
2   yellow
3   red
3   red
4   purple
5   tan
5   blue
5   black

我想将“事件”列分为“年份”和“事件名称”两列,以获取以下输出:

df = pd.DataFrame({
'Event':['2018 Green Meeting','2018 Yellow Meeting','2018 Red Meeting',
'2017 Green Meeting','2017 Yellow Meeting','2017 Red Meeting',
'2016 Green Meeting','2016 Yellow Meeting','2016 Red Meeting',
'Blue Meeting','Purple Meeting','Green Meeting'],
'Count':[1,2,3,4,5,6,7,8,9,10,11,12]
})

当我尝试使用正则表达式执行此操作时。它似乎无法正常工作。我得到两列“年份”和“事件名称”。但是年份是空的。

这就是我正在使用的:

df2 = pd.DataFrame({
'Year':['2018','2018','2018',
'2017','2017','2017',
'2016','2016','2016',
'Blue Meeting','Purple Meeting','Green Meeting'],
'Event_Name':['Green Meeting','Yellow Meeting','Red Meeting',
'Green Meeting','Yellow Meeting','Red Meeting',
'Green Meeting','Yellow Meeting','Red Meeting',
'Blue Meeting','Purple Meeting','Green Meeting'],
'Count':[1,2,3,4,5,6,7,8,9,10,11,12]
})

如何使它正常工作?

4 个答案:

答案 0 :(得分:7)

str.extractfillna一起使用

df['Year']=df.Event.str.extract('(\d+)').fillna(df.Event)

然后我们做replace

df['even_name']=df.Event.str.replace('\d+', '')

答案 1 :(得分:4)

pandas.Series.str.findall

s = df.Event.str.findall('(\d+|\D+)')

pd.DataFrame(dict(
    Count=df.Count,
    Event_Name=s.str[-1],
    Year=s.str[0]
))

    Count       Event_Name            Year
0       1    Green Meeting            2018
1       2   Yellow Meeting            2018
2       3      Red Meeting            2018
3       4    Green Meeting            2017
4       5   Yellow Meeting            2017
5       6      Red Meeting            2017
6       7    Green Meeting            2016
7       8   Yellow Meeting            2016
8       9      Red Meeting            2016
9      10     Blue Meeting    Blue Meeting
10     11   Purple Meeting  Purple Meeting
11     12    Green Meeting   Green Meeting

非正则表达式

def f(x):
  a, b = x.split(None, 1)
  if a.isdecimal():
    return a, b
  else:
    return (x,)

s = df.Event.apply(f)

pd.DataFrame(dict(
    Count=df.Count,
    Event_Name=s.str[-1],
    Year=s.str[0]
))

    Count       Event_Name            Year
0       1    Green Meeting            2018
1       2   Yellow Meeting            2018
2       3      Red Meeting            2018
3       4    Green Meeting            2017
4       5   Yellow Meeting            2017
5       6      Red Meeting            2017
6       7    Green Meeting            2016
7       8   Yellow Meeting            2016
8       9      Red Meeting            2016
9      10     Blue Meeting    Blue Meeting
10     11   Purple Meeting  Purple Meeting
11     12    Green Meeting   Green Meeting

答案 2 :(得分:3)

使用extractall

df[['Year','Event']] = df.Event.str.extractall('(\d{4})? ?(.+$)').reset_index('match', drop=True)

输出:

             Event  Count  Year
0    Green Meeting      1  2018
1   Yellow Meeting      2  2018
2      Red Meeting      3  2018
3    Green Meeting      4  2017
4   Yellow Meeting      5  2017
5      Red Meeting      6  2017
6    Green Meeting      7  2016
7   Yellow Meeting      8  2016
8      Red Meeting      9  2016
9     Blue Meeting     10   NaN
10  Purple Meeting     11   NaN
11   Green Meeting     12   NaN

答案 3 :(得分:0)

这应该完成工作

def get_year(x):
    try:
        return int(x.split()[0])
    except:
        return None

def get_event_name(x):
    try:
        year = int(x.split()[0])
        return ' '.join(x.split()[1: ])
    except:
        return x

df['Year'] = df['Event'].apply(lambda x: get_year(x))
df['Event_Name'] = df['Event'].apply(lambda x: get_event_name(x))
df = df.drop(['Event', ], axis=1)