我有2个数据帧:
df1
+------------+-------------+------+
| Product ID | Cost Method | Rate |
+------------+-------------+------+
| 10 | CPM | 10 |
| 20 | CPC | 0.3 |
| 30 | CPCV | 0.4 |
| 40 | FLF | 100 |
| 50 | VAD | 0 |
| 60 | CPM | 0.1 |
+------------+-------------+------+
df2
+--------+------------+-------------+--------+-----------------+
| Date | Product ID | Impressions | Clicks | Completed Views |
+--------+------------+-------------+--------+-----------------+
| 01-Jan | 10 | 300 | 4 | 0 |
| 02-Jan | 20 | 30 | 3 | 0 |
| 03-Jan | 30 | 200 | 4 | 20 |
| 02-Jan | 40 | 300 | 4 | 0 |
| 02-Jan | 40 | 500 | 4 | 0 |
| 03-Jan | 40 | 200 | 3 | 0 |
| 04-Jan | 90 | 3000 | 3 | 0 |
| 05-Jan | 50 | 3000 | 5 | 0 |
+--------+------------+-------------+--------+-----------------+
理想的输出是:
+--------+------------+-------------+--------+-----------------+--------+
| Date | Product ID | Impressions | Clicks | Completed Views | Spend |
+--------+------------+-------------+--------+-----------------+--------+
| 01-Jan | 10 | 300 | 4 | 0 | $3 |
| 02-Jan | 20 | 30 | 3 | 0 | $1 |
| 03-Jan | 30 | 200 | 4 | 20 | $8 |
| 02-Jan | 40 | 300 | 4 | 0 | $50 |
| 02-Jan | 40 | 500 | 4 | 0 | $50 |
| 03-Jan | 40 | 200 | 3 | 0 | $- |
| 04-Jan | 90 | 3000 | 3 | 0 | $- |
| 05-Jan | 50 | 3000 | 5 | 0 | $- |
+--------+------------+-------------+--------+-----------------+--------+
其中:
答案 0 :(得分:2)
即使我不是真的,我也会回答你。你是Stack Overflow(SO)的新手,所以让这个教育性的帖子。请放心,这篇文章的基调并不是试图屈服或苛刻。
首先,要问一个正确的问题(请this请),你需要做两件事:
对于可行的示例,您有点这样做,但您选择的格式真的很烦人,因为无法直接使用pd.read_clipboard()
来加载数据。这里的人们志愿服务他们的时间,如果他们不得不花5或10分钟重新创建数据,他们可能就是不会这样做。
以下我将如何做到这一点:
这是第一个数据框,使用df1 = pd.read_clipboard(index_col=0)
加载它:
ProductID CostMethod Rate
10 CPM 10.0
20 CPC 0.3
30 CPCV 0.4
40 FLF 100.0
50 VAD 0.0
60 CPM 0.1
以下是第二个数据框,使用df2 = pd.read_clipboard(index_col=0)
加载它:
ProductID Date Impressions Clicks CompletedViews
10 01-Jan 300 4 0
20 02-Jan 30 3 0
30 03-Jan 200 4 20
40 02-Jan 300 4 0
40 02-Jan 500 4 0
40 03-Jan 200 3 0
90 04-Jan 3000 3 0
50 05-Jan 3000 5 0
现在,就做功课而言,这是一个建议的解决方案。我相信你会尝试理解这段代码的作用,而不仅仅是重用它。
第1步:合并两个数据框
我在df2上合并,这非常重要。阅读Merging
上的pandas文档中的更多内容df3 = df2.merge(df1, left_index=True, right_index=True, how='left')
df3
第2步:计算您的支出
我们要编写自定义函数,然后执行dataframe.apply
def calc_spend(row):
"""
Accepts a row of the dataframe (df3.apply(calc_spend, axis=1)),
and computes the spend according to these rules:
* If costMethod is NaN, then zero
* Where FLF is calculated as the sum of total impressions for that product per day,
and if that sums is over a certain minimum limit,
e.g. 600 impressions, then the rate is applied.
If there are two or more entries for the same day,
then the rate is divided equally by the count of times it appears in the same day
* Where, if a product is VAD, then the spend is 0
* Where CPC is calculated as the rate times the number of clicks
* Where CPM is calculated as rate*(impression / 1000)
"""
if row.CostMethod == 'FLF':
# Calc the sum of total impressions for that product
# I'm using boolean indexing to select the rows where both productID and Date
# are the same as the current row
filterdateproductid = (df3.Date == row.Date) & (df3.index == row.name)
total_impressions = df3.ix[filterdateproductid, 'Impressions'].sum()
if total_impressions < 600:
spend = total_impressions
else:
count = df3.ix[filterdateproductid].shape[0]
rate = row.Rate / count # If you use python 2.7 make sure you do "from future import division"
spend = rate * total_impressions / 1000.0
elif row.CostMethod == 'VAD':
spend = 0
elif row.CostMethod == 'CPC':
spend = row.Rate * row.Clicks
elif row.CostMethod == 'CPM':
spend = row.Rate * row.Impressions / 1000.0
else: # Includes the case where the costMethod is Na
spend = 0
return spend
现在我们可以自己应用函数:
df3['Spend'] = df3.apply(calc_spend, axis=1)
df3
你或许会注意到&#34;花费&#34;我的计算结果与你的完全相同,但这是因为你对如何计算它的初始规格并不是那么好。您可以轻松更改calc_spend
功能以符合您的要求。