Question

我想从“数据框”列data3['CopyRight']中提取年份。

CopyRight
2015 Sony Music Entertainment
2015 Ultra Records , LLC under exclusive license
2014 , 2015 Epic Records , a division of Sony Music Entertainment
Compilation ( P ) 2014 Epic Records , a division of Sony Music Entertainment
2014 , 2015 Epic Records , a division of Sony Music Entertainment
2014 , 2015 Epic Records , a division of Sony Music Entertainment

我正在使用以下代码提取年份：

data3['CopyRight_year'] = data3['CopyRight'].str.extract('([0-9]+)', expand=False).str.strip()

使用我的代码，我只能获得年份的第一个出现。

CopyRight_year
2015
2015
2014
2014
2014
2014

我想提取列中提到的所有年份。

预期产量

CopyRight_year
    2015
    2015
    2014,2015
    2014
    2014,2015
    2014,2015

Answer 1

您当前的正则表达式将只捕获数字，如果要捕获逗号分隔的年份，则需要对此进行增强，

[0-9]+(?:\s+,\s+[0-9]+)*

此正则表达式[0-9]+将匹配数字，并且另外(?:\s+,\s+[0-9]+)*正则表达式将匹配一个或多个空格，后跟一个逗号，再匹配一个或多个空格，最后是一个数字，然后整个零或数据中可用的次数。

Demo

将熊猫数据框线更改为此，

data3['CopyRight_year'] = data3['CopyRight'].str.extract('([0-9]+(?:\s+,\s+[0-9]+)*)', expand=False).str.replace('\s+','')

打印

                                           CopyRight CopyRight_year
0                      2015 Sony Music Entertainment           2015
1   2015 Ultra Records , LLC under exclusive license           2015
2  2014 , 2015 Epic Records , a 1999 division of ...      2014,2015
3  Compilation ( P ) 2014 Epic Records , a divisi...           2014
4  2014 , 2015 Epic Records , a division of Sony ...      2014,2015
5  2014 , 2015 Epic Records , a division of Sony ...      2014,2015

尽管我喜欢jezrael的答案，它使用了findall和join，为您提供了更大的灵活性和更简洁的方法。

Answer 2

将findall与正则表达式一起使用，以查找长度为4的所有整数到列表中，并用分隔符最后join进行查找：

谢谢@WiktorStribiżew提出的想法，并在字词边界r'\b\d{4}\b'：

data3['CopyRight_year'] = data3['CopyRight'].str.findall(r'\b\d{4}\b').str.join(',')
print (data3)
                                           CopyRight CopyRight_year
0                      2015 Sony Music Entertainment           2015
1   2015 Ultra Records , LLC under exclusive license           2015
2  2014 , 2015 Epic Records , a division of Sony ...      2014,2015
3  Compilation ( P ) 2014 Epic Records , a divisi...           2014
4  2014 , 2015 Epic Records , a division of Sony ...      2014,2015
5  2014 , 2015 Epic Records , a division of Sony ...      2014,2015

在python中使用Regex提取日期

2 个答案: