Question

df是包含以下信息的数据框。

 In [61]: df.head()
    Out[61]: 
       id  movie_id                  info
    0   1         1   Italy:1 January 1994
    1   2         2   USA:22 January 2006
    2   3         3   USA:12 February 2006
    3   4         4   USA:February 2006
    4   5         5   USA:2006

我想要输出如下：

In [61]: df.head()    
Out[61]: 
   id  movie_id    country Date    Month   Year
0   1         1    Italy    1     January  1994
1   2         2    USA      22    January  2006
2   3         3    USA      12    February 2006
3   4         4    USA      None  February 2006
4   5         5    USA      None  None     2006

数据存储在数据框中，必须覆盖到数据框中。

Answer 1

您可以使用正则表达式:|\s+在分号或空格上拆分列，并将expand参数指定为true，以便结果扩展为列：

df[["country","Date","Month","Year"]] = df['info'].str.split(':|\s+', expand = True)

更新：

要处理可选的缺失日期和月份，您可以尝试使用正则表达式extract：

(df[["country","Date","Month","Year"]] = 
     df['info'].str.extract('^([A-Za-z]+):(\d{1,2})? ?([A-Za-z]+)? ?(\d{4})$'))

^([A-Za-z]+):(\d{1,2})? ?([A-Za-z]+)? ?(\d{4})$'包含分别对应country, Date, Month, Year的四个捕获组;
^和$表示字符串的开头和结尾;
([A-Za-z]+)捕获:之前的国家/地区，其中包含字母;
(\d{1,2})捕获由一位或两位数字组成的日期，但是可选的（在组后面有?），即可能丢失;
([A-Za-z]+)捕获由字母组成的月份，并将其标记为可选?;
(\d{4})捕获由四位数组成的年份;

Answer 2

使用split字符串方法。

In [163]: df[['country', 'date', 'month', 'year']] = df['info'].str.split('\W+', expand=True)

In [164]: df
Out[164]:
   id  movie_id                  info country date     month  year
0   1         1  Italy:1 January 1994   Italy    1   January  1994
1   2         2   USA:22 January 2006     USA   22   January  2006
2   3         3  USA:12 February 2006     USA   12  February  2006
3   4         4  USA:19 February 2006     USA   19  February  2006
4   5         5   USA:22 January 2006     USA   22   January  2006

如何将列数据拆分为存储在数据框中的其他列？

2 个答案: