我正在寻找以下问题的答案,但尚未找到答案。我有一个像这样的小例子的大数据集:
df =
A B
1 I bought 3 apples in 2013
3 I went to the store in 2020 and got milk
1 In 2015 and 2019 I went on holiday to Spain
2 When I was 17, in 2014 I got a new car
3 I got my present in 2018 and it broke down in 2019
我想要提取> 1950的所有值并将其作为最终结果:
A B C
1 I bought 3 apples in 2013 2013
3 I went to the store in 2020 and got milk 2020
1 In 2015 and 2019 I went on holiday to Spain 2015_2019
2 When I was 17, in 2014 I got a new car 2014
3 I got my present in 2018 and it broke down in 2019 2018_2019
我尝试首先提取值,但没有超出以下范围:
df["C"] = df["B"].str.extract('(\d+)').astype(int)
df["C"] = df["B"].apply(lambda x: re.search(r'\d+', x).group())
但是我得到的只是错误消息(几周前我才启动python并使用文本。)。有人可以帮我吗?
答案 0 :(得分:3)
这是使用str.findall
并将结果列表中大于1950
的那些项目连接起来的一种方法:
s = df["B"].str.findall('\d+')
df['C'] = s.apply(lambda x: '_'.join(i for i in x if int(i)> 1950))
A B C
0 1 I bought 3 apples in 2013 2013
1 3 I went to the store in 2020 and got milk 2020
2 1 In 2015 and 2019 I went on holiday to Spain 2015_2019
3 2 When I was 17, in 2014 I got a new car 2014
4 3 I got my present in 2018 and it broke down in ... 2018_2019
答案 1 :(得分:1)
使用单个正则表达式模式(考虑您的注释“ 需要年发生” ”)
In [268]: pat = re.compile(r'\b(19(?:[6-9]\d|5[1-9])|[2-9]\d{3})')
In [269]: df['C'] = df['B'].apply(lambda x: '_'.join(pat.findall(x)))
In [270]: df
Out[270]:
A B C
0 1 I bought 3 apples in 2013 2013
1 3 I went to the store in 2020 and got milk 2020
2 1 In 2015 and 2019 I went on holiday to Spain 2015_2019
3 2 When I was 17, in 2014 I got a new car 2014
4 3 I got my present in 2018 and it broke down in ... 2018_2019