如何将列数据拆分为存储在数据框中的其他列?

时间:2016-10-16 13:54:50

标签: python pandas

df是包含以下信息的数据框。

 In [61]: df.head()
    Out[61]: 
       id  movie_id                  info
    0   1         1   Italy:1 January 1994
    1   2         2   USA:22 January 2006
    2   3         3   USA:12 February 2006
    3   4         4   USA:February 2006
    4   5         5   USA:2006

我想要输出如下:

In [61]: df.head()    
Out[61]: 
   id  movie_id    country Date    Month   Year
0   1         1    Italy    1     January  1994
1   2         2    USA      22    January  2006
2   3         3    USA      12    February 2006
3   4         4    USA      None  February 2006
4   5         5    USA      None  None     2006

数据存储在数据框中,必须覆盖到数据框中。

2 个答案:

答案 0 :(得分:2)

您可以使用正则表达式:|\s+在分号或空格上拆分列,并将expand参数指定为true,以便结果扩展为列:

df[["country","Date","Month","Year"]] = df['info'].str.split(':|\s+', expand = True)

enter image description here

更新

要处理可选的缺失日期和月份,您可以尝试使用正则表达式extract

(df[["country","Date","Month","Year"]] = 
     df['info'].str.extract('^([A-Za-z]+):(\d{1,2})? ?([A-Za-z]+)? ?(\d{4})$'))
  • ^([A-Za-z]+):(\d{1,2})? ?([A-Za-z]+)? ?(\d{4})$'包含分别对应country, Date, Month, Year的四个捕获组;
  • ^$表示字符串的开头和结尾;
  • ([A-Za-z]+)捕获:之前的国家/地区,其中包含字母;
  • (\d{1,2})捕获由一位或两位数字组成的日期,但是可选的(在组后面有?),即可能丢失;
  • ([A-Za-z]+)捕获由字母组成的月份,并将其标记为可选?;
  • (\d{4})捕获由四位数组成的年份;

enter image description here

答案 1 :(得分:1)

使用split字符串方法。

In [163]: df[['country', 'date', 'month', 'year']] = df['info'].str.split('\W+', expand=True)

In [164]: df
Out[164]:
   id  movie_id                  info country date     month  year
0   1         1  Italy:1 January 1994   Italy    1   January  1994
1   2         2   USA:22 January 2006     USA   22   January  2006
2   3         3  USA:12 February 2006     USA   12  February  2006
3   4         4  USA:19 February 2006     USA   19  February  2006
4   5         5   USA:22 January 2006     USA   22   January  2006