使用正则表达式

时间:2017-06-28 13:07:37

标签: python regex pandas dataframe

这是我的数据框

index     duration 
1           7 year   
2           2day
3           4 week
4           8 month

我需要将数字与时间分开并将它们放在两个新列中。输出是这样的:

index     duration         number     time
1           7 year          7         year
2           2day            2         day
3           4 week          4        week
4           8 month         8         month

这是我的代码:

df ['numer'] = df.duration.replace(r'\d.*' , r'\d', regex=True, inplace = True)
df [ 'time']= df.duration.replace (r'\.w.+',r'\w.+', regex=True, inplace = True )

但它不起作用。有什么建议吗?

我还需要根据时间列的值创建另一个列。所以新的数据集是这样的:

 index     duration         number     time      time_days
    1           7 year          7         year       365
    2           2day            2         day         1
    3           4 week          4        week         7
    4           8 month         8         month       30

df['time_day']= df.time.replace(r'(year|month|week|day)', r'(365|30|7|1)', regex=True, inplace=True)

有什么建议吗?

2 个答案:

答案 0 :(得分:3)

我们可以在这里使用Series.str.extract

In [67]: df[['number','time']] = df.duration.str.extract(r'(\d+)\s*(.*)', expand=True)

In [68]: df
Out[68]:
   index duration number    time
0      1   7 year      7    year
1      2     2day      2     day
2      3   4 week      4    week
3      4  8 month      8   month

RegEx explained - regex101.com是IMO最好的在线RegEx解析器,测试人员和解释器之一

您可能还想将number列转换为整数dtype:

In [69]: df['number'] = df['number'].astype(int)

In [70]: df.dtypes
Out[70]:
index        int64
duration    object
number       int32
time        object
dtype: object

<强>更新

In [167]: df['time_day'] = df['time'].replace(['year','month','week','day'], [365, 30, 7, 1], regex=True)

In [168]: df
Out[168]:
   index duration number    time  time_day
0      1   7 year      7    year       365
1      2     2day      2     day         1
2      3   4 week      4    week         7
3      4  8 month      8   month        30

答案 1 :(得分:2)

您可以str.extract使用astype

df = df['duration'].str.extract(r'(?P<number>\d+)\s*(?P<time>\w+)', expand=True)
#convert to int
df['number'] = df['number'].astype(int)
print (df)
   number   time
0       7   year
1       2    day
2       4   week
3       8  month

Extracting substrings

添加到原始DataFrame

df = df.join(df['duration'].str.extract(r'(?P<number>\d+)\s*(?P<time>\w+)', expand=True))
#convert to int
df['number'] = df['number'].astype(int)
print (df)
   index duration  number   time
0      1   7 year       7   year
1      2     2day       2    day
2      3   4 week       4   week
3      4  8 month       8  month
df[['number','time']] = df['duration'].str.extract(r'(\d+)\s*(\w+)', expand=True)
#convert to int
df['number'] = df['number'].astype(int)
print (df)
   index duration  number   time
0      1   7 year       7   year
1      2     2day       2    day
2      3   4 week       4   week
3      4  8 month       8  month