Question

我正在处理示例数据集：

       date      name     point
0   4/24/2019   Martha   3617138
1   4/25/2019   Martha   3961918
2   4/26/2019   Martha   4774966
3   4/27/2019   Martha   5217946
4   4/24/2019   Alex     62700321
5   4/25/2019   Alex     66721020
6   4/26/2019   Alex     71745138
7   4/27/2019   Alex     88762943
8   4/28/2019   Alex    102772578
9   4/29/2019   Alex    129089274
10  3/1/2019    Josh     1063259
11  3/3/2019    Josh     1063259
12  3/4/2019    Josh     1063259
13  3/5/2019    Josh     1063259
14  3/6/2019    Josh     1063259

和名称值列表

nameslist = ['Martha', 'Alex', 'Josh']

我想根据名称列中的标识符来计算所有行的变化百分比。

预期输出：

name    percent change
Martha      30.7
Alex        51.4
Josh          0

最初，我尝试遍历列表和表，并添加与列表值匹配的所有行，在列表中附加更改值，然后移动列表的下一个值，但是我无法清楚表达正确地编写代码以实现这一目标。

df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(by='date')

growthlist=[]
temptable=[]
for i in nameslist:
    for j in df:
        temptable.append(df[df['name'].str.match(nameslist[i])])
        length=[]
        growth=temptable[0]-temptable[length-1]
        growthlist.append(i,growth)

但是会产生错误：

TypeError: list indices must be integers or slices, not str

我也不介意使用.groupby（）和.pct_change（）来实现此目标，但是

growth = df.groupby('name').pct_change()

生成以以下结尾的长回溯：

TypeError: unsupported operand type(s) for /: 'str' and 'float'

最后，我想将此嵌套在一个函数中，以便可以在其他数据集上使用它并能够选择我的列名（我正在使用的实际数据集未标准化，因此目标列名通常会有所不同）

def calc_growth(dataset,colname):

但是我不确定这个问题是否太多。

不幸的是，我对这个问题很迷茫，因此我们将不胜感激。我也想知道变换是否是一种更简单的方法，因为至少我将永远知道我需要计算的两个图形的确切位置，但是我什至不知道如何开始这样的事情。

谢谢

Answer 1

您可以将apply与last的{{1}}和first值一起使用，以计算整个组的百分比变化：

.values

说明

首先，我们在df.groupby('name',sort=False).apply(lambda x: (x['point'].values[-1] - x['point'].values[0]) / x['point'].values[-1] * 100)\ .reset_index(name='pct change') name pct change 0 Martha 30.67889165583545363347 1 Alex 51.42871358932579539669 2 Josh 0.00000000000000000000上使用groupby，这将根据每个唯一名称为我们提供一个分组（读取：一个数据帧）：

name

然后，我们将自己制作的for _, d in df.groupby('name', sort=False): print(d, '\n') date name point 0 2019-04-24 Martha 3617138 1 2019-04-25 Martha 3961918 2 2019-04-26 Martha 4774966 3 2019-04-27 Martha 5217946 date name point 4 2019-04-24 Alex 62700321 5 2019-04-25 Alex 66721020 6 2019-04-26 Alex 71745138 7 2019-04-27 Alex 88762943 8 2019-04-28 Alex 102772578 9 2019-04-29 Alex 129089274 date name point 10 2019-03-01 Josh 1063259 11 2019-03-03 Josh 1063259 12 2019-03-04 Josh 1063259 13 2019-03-05 Josh 1063259 14 2019-03-06 Josh 1063259函数应用于每个单独的组，并应用以下计算：

百分比变化=（点的最后一个值-点的最后一个值）/点的最后一个值* 100

然后我们使用lambda将reset_index列移出索引，因为name将其列为索引。

Answer 2

假设有第四列，也许像下面的描述

       date      name     point      descr
0   4/24/2019   Martha   3617138      12g of ecg
1   4/25/2019   Martha   3961918      12g of eg
2   4/26/2019   Martha   4774966      43m of grams
3   4/27/2019   Martha   5217946      13cm of dose
4   4/24/2019   Alex     62700321     32m of grams
5   4/25/2019   Alex     66721020     12g of egc
6   4/26/2019   Alex     71745138      43m of grams
7   4/27/2019   Alex     88762943      30cm of dose
8   4/28/2019   Alex    102772578      12g of egc
9   4/29/2019   Alex    129089274      43m of grams
10  3/1/2019    Josh     1063259       13cm of dose
11  3/3/2019    Josh     1063259       12g of eg
12  3/4/2019    Josh     1063259       12g of eg
13  3/5/2019    Josh     1063259       43m of grams   
14  3/6/2019    Josh     1063259       43m of grams

您可以将代码重写为

df.groupby('name',sort=False).orderby('descr').apply(lambda x: (x['point'].values[-1] - x['point'].values[0]) / x['point'].values[-1] * 100)\
    .reset_index(name='pct change')\.reset_index(name='descr')

或者您认为合并描述列的正确方法是什么？

根据其他列值计算熊猫列值的变化百分比（随时间变化）

2 个答案:

说明