找到子字符串,然后使用该字符串中的数字计算pandas中的新列

时间:2018-02-22 17:24:32

标签: python pandas

我有一个从网站上抓取5列的数据框。我想要做的是根据前两列的内容创建一个额外的列,例如,数据如下所示:

Duration                                                               Issues in 1 year
Pay by Annual Recurring Payment                                         51
Pay every 3 months by Recurring Payment                                 51
Pay every 6 months by Recurring Payment                                 51
First 3 issues for £3, then £15 recurring every 6 months thereafter     14
One off payment - Pay for 1 year                                        14
First 6 issues for £10, then £15 recurring every 6 months thereafter     9
One-Off Payment – Pay for 9 issues                                      12
One-Off Payment – Pay for 20 issues                                     51
First year for £29.99, then £20 recurring every 6 months thereafter     13

我希望有一个额外的列,其中包含基于'持续时间'的交易月数。 string和(当nsesecery)使用1年内的问题来计算月数'专栏。

我设法通过将持续时间复制到新列并使用' str.contains'

来满足大多数人的需求。
df1['Months'] = df1['Duration']
df1.loc[df1['Months'].str.contains('1 year|annual', case=False), 'Months'] = 12
df1.loc[df1['Months'].str.contains('6 months by', case=False), 'Months'] = 6
df1.loc[df1['Months'].str.contains('3 months by', case=False), 'Months'] = 3

以上确实看起来有点笨重,我觉得可能有一个更流畅的解决方案,但它确实有效。

对于前3个或6个问题有固定成本的持续时间,我只对初始付款的月数感兴趣,所以使用了:

df1.loc[df1['Months'].str.contains('first 3', case=False), 'Months'] = round((12 / df1.Issues) * 3,0)

上述情况似乎有效,但可能更有效。

我现在因为支付x问题'而感到非常困难。类型。我需要能够识别具有该模式的字符串,然后使用其中的数字来计算答案,我尝试使用与之前相同的方法,但使用提取但我得到了意想不到的关键字争论'情况下':

df1.loc[df1['Months'].str.contains('Pay for (.+?) issues', case=False), 'Months'] = round((12 / df1.Issues) * df1.loc[df1['Months'].str.extract('Pay for (.+?) issues', case=False), 'Months'],0)

我不确定我的正则表达式逻辑是否正确,因为我仍在掌握它,但我从this post复制了它。

To(try and)simplfy;我正努力实现:

  

如果' 一次性付款 - 支付 20 问题'包含' ...支付 x   问题...' = 12 /问题(51)* 20

哪个会给出最终结果:

Duration                                  Issues in 1 year      Months
One-Off Payment – Pay for 20 issues       51                    5

此外,如果有一种简单的方法可以执行上述操作,我假设逻辑可以应用于每x个月支付一次......'字符串。

任何帮助都会受到超级赞赏,我是新手,并试图找到答案好几天但没有结果。

1 个答案:

答案 0 :(得分:0)

假设'支付x问题'语句不包含任何其他号码,您可以试试这个。

import re
import pandas as pd

## sample data frame
df = pd.DataFrame({'Duration':['Pay by Annual Recurring Payment',                                         
'Pay every 3 months by Recurring Payment',                               
'Pay every 6 months by Recurring Payment',                               
'First 3 issues for £3, then £15 recurring every 6 months thereafter',
'One off payment - Pay for 1 year',
'First 6 issues for £10, then £15 recurring every 6 months thereafter',
'One-Off Payment – Pay for 9 issues',                                 
'One-Off Payment – Pay for 20 issues',  
'First year for £29.99, then £20 recurring every 6 months thereafter'], 'Issues_in_1_year' : [51, 51, 51,14,14,9,12,51,13]  })

## extract month and pay value in separate columns
df['Months'] = df['Duration'].str.extract('(\d+) months by').fillna(-1).astype(int)
df.loc[df['Duration'].str.contains('(\d+) year| (\d+) annual | Annual'),'Months'] = 12
df['Pay_Value'] = df['Duration'].str.extract('Pay for (\d+)').fillna(-1).astype(int)

## calculate solution
def get_sol(row):
    if row.Months == -1 and row.Pay_Value == -1:
         return 0
    elif row.Months != -1 and row.Pay_Value == -1:
        return round((12/ row.Issues_in_1_year) * row.Months)
    elif row.Months == -1 and row.Pay_Value != -1:
        return round((12/ row.Issues_in_1_year) * row.Pay_Value) 

df['solution'] = df.apply(get_sol, axis=1)
print(df)

并且,输出看起来像这样,其中solution是我们计算的列(几行):

    Duration                                 Issues_in_1_year   Months  Pay_Value   solution
0   Pay by Annual Recurring Payment                 51           12        -1       3
1   Pay every 3 months by Recurring Payment         51            3        -1       1
2   Pay every 6 months by Recurring Payment         51            6        -1       1
3   One-Off Payment – Pay for 20 issues             51           -1        20       5