df是包含以下信息的数据框。
In [61]: df.head()
Out[61]:
id movie_id info
0 1 1 Italy:1 January 1994
1 2 2 USA:22 January 2006
2 3 3 USA:12 February 2006
3 4 4 USA:February 2006
4 5 5 USA:2006
我想要输出如下:
In [61]: df.head()
Out[61]:
id movie_id country Date Month Year
0 1 1 Italy 1 January 1994
1 2 2 USA 22 January 2006
2 3 3 USA 12 February 2006
3 4 4 USA None February 2006
4 5 5 USA None None 2006
数据存储在数据框中,必须覆盖到数据框中。
答案 0 :(得分:2)
您可以使用正则表达式:|\s+
在分号或空格上拆分列,并将expand
参数指定为true,以便结果扩展为列:
df[["country","Date","Month","Year"]] = df['info'].str.split(':|\s+', expand = True)
更新:
要处理可选的缺失日期和月份,您可以尝试使用正则表达式extract
:
(df[["country","Date","Month","Year"]] =
df['info'].str.extract('^([A-Za-z]+):(\d{1,2})? ?([A-Za-z]+)? ?(\d{4})$'))
^([A-Za-z]+):(\d{1,2})? ?([A-Za-z]+)? ?(\d{4})$'
包含分别对应country, Date, Month, Year
的四个捕获组; ^
和$
表示字符串的开头和结尾; ([A-Za-z]+)
捕获:
之前的国家/地区,其中包含字母; (\d{1,2})
捕获由一位或两位数字组成的日期,但是可选的(在组后面有?
),即可能丢失; ([A-Za-z]+)
捕获由字母组成的月份,并将其标记为可选?
; (\d{4})
捕获由四位数组成的年份; 答案 1 :(得分:1)
使用split
字符串方法。
In [163]: df[['country', 'date', 'month', 'year']] = df['info'].str.split('\W+', expand=True)
In [164]: df
Out[164]:
id movie_id info country date month year
0 1 1 Italy:1 January 1994 Italy 1 January 1994
1 2 2 USA:22 January 2006 USA 22 January 2006
2 3 3 USA:12 February 2006 USA 12 February 2006
3 4 4 USA:19 February 2006 USA 19 February 2006
4 5 5 USA:22 January 2006 USA 22 January 2006