我有一个从网站上抓取5列的数据框。我想要做的是根据前两列的内容创建一个额外的列,例如,数据如下所示:
Duration Issues in 1 year
Pay by Annual Recurring Payment 51
Pay every 3 months by Recurring Payment 51
Pay every 6 months by Recurring Payment 51
First 3 issues for £3, then £15 recurring every 6 months thereafter 14
One off payment - Pay for 1 year 14
First 6 issues for £10, then £15 recurring every 6 months thereafter 9
One-Off Payment – Pay for 9 issues 12
One-Off Payment – Pay for 20 issues 51
First year for £29.99, then £20 recurring every 6 months thereafter 13
我希望有一个额外的列,其中包含基于'持续时间'的交易月数。 string和(当nsesecery)使用1年内的问题来计算月数'专栏。
我设法通过将持续时间复制到新列并使用' str.contains'
来满足大多数人的需求。df1['Months'] = df1['Duration']
df1.loc[df1['Months'].str.contains('1 year|annual', case=False), 'Months'] = 12
df1.loc[df1['Months'].str.contains('6 months by', case=False), 'Months'] = 6
df1.loc[df1['Months'].str.contains('3 months by', case=False), 'Months'] = 3
以上确实看起来有点笨重,我觉得可能有一个更流畅的解决方案,但它确实有效。
对于前3个或6个问题有固定成本的持续时间,我只对初始付款的月数感兴趣,所以使用了:
df1.loc[df1['Months'].str.contains('first 3', case=False), 'Months'] = round((12 / df1.Issues) * 3,0)
上述情况似乎有效,但可能更有效。
我现在因为支付x问题'而感到非常困难。类型。我需要能够识别具有该模式的字符串,然后使用其中的数字来计算答案,我尝试使用与之前相同的方法,但使用提取但我得到了意想不到的关键字争论'情况下':
df1.loc[df1['Months'].str.contains('Pay for (.+?) issues', case=False), 'Months'] = round((12 / df1.Issues) * df1.loc[df1['Months'].str.extract('Pay for (.+?) issues', case=False), 'Months'],0)
我不确定我的正则表达式逻辑是否正确,因为我仍在掌握它,但我从this post复制了它。
To(try and)simplfy;我正努力实现:
如果' 一次性付款 - 支付 20 问题'包含' ...支付 x 问题...' = 12 /问题(51)* 20
哪个会给出最终结果:
Duration Issues in 1 year Months
One-Off Payment – Pay for 20 issues 51 5
此外,如果有一种简单的方法可以执行上述操作,我假设逻辑可以应用于每x个月支付一次......'字符串。
任何帮助都会受到超级赞赏,我是新手,并试图找到答案好几天但没有结果。
答案 0 :(得分:0)
假设'支付x问题'语句不包含任何其他号码,您可以试试这个。
import re
import pandas as pd
## sample data frame
df = pd.DataFrame({'Duration':['Pay by Annual Recurring Payment',
'Pay every 3 months by Recurring Payment',
'Pay every 6 months by Recurring Payment',
'First 3 issues for £3, then £15 recurring every 6 months thereafter',
'One off payment - Pay for 1 year',
'First 6 issues for £10, then £15 recurring every 6 months thereafter',
'One-Off Payment – Pay for 9 issues',
'One-Off Payment – Pay for 20 issues',
'First year for £29.99, then £20 recurring every 6 months thereafter'], 'Issues_in_1_year' : [51, 51, 51,14,14,9,12,51,13] })
## extract month and pay value in separate columns
df['Months'] = df['Duration'].str.extract('(\d+) months by').fillna(-1).astype(int)
df.loc[df['Duration'].str.contains('(\d+) year| (\d+) annual | Annual'),'Months'] = 12
df['Pay_Value'] = df['Duration'].str.extract('Pay for (\d+)').fillna(-1).astype(int)
## calculate solution
def get_sol(row):
if row.Months == -1 and row.Pay_Value == -1:
return 0
elif row.Months != -1 and row.Pay_Value == -1:
return round((12/ row.Issues_in_1_year) * row.Months)
elif row.Months == -1 and row.Pay_Value != -1:
return round((12/ row.Issues_in_1_year) * row.Pay_Value)
df['solution'] = df.apply(get_sol, axis=1)
print(df)
并且,输出看起来像这样,其中solution是我们计算的列(几行):
Duration Issues_in_1_year Months Pay_Value solution
0 Pay by Annual Recurring Payment 51 12 -1 3
1 Pay every 3 months by Recurring Payment 51 3 -1 1
2 Pay every 6 months by Recurring Payment 51 6 -1 1
3 One-Off Payment – Pay for 20 issues 51 -1 20 5