Tidy Pandas数据框:对现有行的操作和附加结果

时间:2017-06-01 19:48:40

标签: pandas

我有一个整洁的数据框tidy

>>> import pandas as pd
    import uuid
    import random

    tidy = pd.DataFrame(columns=['measure_type', 'sensor', 'value'])
    for measurement, sensor in zip(5*['type_a'] + 5*['type_b'], 
                                   2*[uuid.uuid4() for _ in range(5)]
                                   ):
        tidy = tidy.append(pd.Series({'measure_type':measurement, 
                                      'sensor':sensor,
                                      'value':random.random()
                                      }
                                     ), ignore_index=True)
>>> tidy
        measure_type    sensor                                  value
    0   type_a          f9726059-1352-49fb-9cc7-cffdf84db325    0.323960
    1   type_a          f3724608-3c28-49c7-a237-09b02a75694b    0.727934
    2   type_a          d59d29ec-32cb-4853-b822-8ac9abec07b9    0.357074
    3   type_a          4a384d86-6288-49f3-be5d-8d54a811b9bd    0.312051
    4   type_a          e59f5497-eb25-4084-816a-297d67768891    0.750661
    5   type_b          f9726059-1352-49fb-9cc7-cffdf84db325    0.424161
    6   type_b          f3724608-3c28-49c7-a237-09b02a75694b    0.608558
    7   type_b          d59d29ec-32cb-4853-b822-8ac9abec07b9    0.759485
    8   type_b          4a384d86-6288-49f3-be5d-8d54a811b9bd    0.095980
    9   type_b          e59f5497-eb25-4084-816a-297d67768891    0.382245

我想在这个整洁的数据帧中追加5行,对于每个传感器,读取type_a和读取type_b之间的差异。这些新行的测量类型是type_c

我发现这样做的方式太长而且不对。这是有效的:

>>> df_a = tidy[tidy['measure_type']=='type_a'] # New df with readings for a only
>>> df_a.set_index(keys='sensor', inplace=True) # Make sensor ID the key
>>> df_a.drop('measure_type', axis=1, inplace=True) #Keep only the reading value
>>> df_a

sensor                                  value    
f9726059-1352-49fb-9cc7-cffdf84db325    0.323960
f3724608-3c28-49c7-a237-09b02a75694b    0.727934
d59d29ec-32cb-4853-b822-8ac9abec07b9    0.357074
4a384d86-6288-49f3-be5d-8d54a811b9bd    0.312051
e59f5497-eb25-4084-816a-297d67768891    0.750661

为type_b做同样的事情......

[...]
>>> df_b
sensor                                  value    
f9726059-1352-49fb-9cc7-cffdf84db325    0.424161
f3724608-3c28-49c7-a237-09b02a75694b    0.608558
d59d29ec-32cb-4853-b822-8ac9abec07b9    0.759485
4a384d86-6288-49f3-be5d-8d54a811b9bd    0.095980
e59f5497-eb25-4084-816a-297d67768891    0.382245

现在我可以减去这两个:

>>> df_c = df_a - df_b

我需要添加一个测量类型的列:

>>> df_c['measure_type'] = 'type_c'

只有这样才能追加,我必须重置索引:

>>> tidy = tidy.append(c)
>>> tidy.reset_index(drop=True)
>>> tidy
    measure_type   sensor                               value
0   type_a         f9726059-1352-49fb-9cc7-cffdf84db325 0.323960
1   type_a         f3724608-3c28-49c7-a237-09b02a75694b 0.727934
2   type_a         d59d29ec-32cb-4853-b822-8ac9abec07b9 0.357074
3   type_a         4a384d86-6288-49f3-be5d-8d54a811b9bd 0.312051
4   type_a         e59f5497-eb25-4084-816a-297d67768891 0.750661
5   type_b         f9726059-1352-49fb-9cc7-cffdf84db325 0.424161
6   type_b         f3724608-3c28-49c7-a237-09b02a75694b 0.608558
7   type_b         d59d29ec-32cb-4853-b822-8ac9abec07b9 0.759485
8   type_b         4a384d86-6288-49f3-be5d-8d54a811b9bd 0.095980
9   type_b         e59f5497-eb25-4084-816a-297d67768891 0.382245
10  type_c         f9726059-1352-49fb-9cc7-cffdf84db325 -0.100200
11  type_c         f3724608-3c28-49c7-a237-09b02a75694b 0.119377
12  type_c         d59d29ec-32cb-4853-b822-8ac9abec07b9 -0.402411
13  type_c         4a384d86-6288-49f3-be5d-8d54a811b9bd 0.216071
14  type_c         e59f5497-eb25-4084-816a-297d67768891 0.368416

这简直不是最简单的方法。

修改

该方法还应与tidy2tidy3一起使用,其中:

tidy2 = tidy.drop(1)

tidy3 = tidy
tidy3.loc[0, 'sensor']='some other uuid'

即。在每个传感器中的键或键数不相同的情况下。

2 个答案:

答案 0 :(得分:3)

使用Nan返回编辑以满足tidy2和tidy3条件:

pd.concat([tidy,
           tidy.groupby(['sensor'])
               .apply(lambda x: x.loc[x.measure_type == 'type_a','value'].max()-x.loc[x.measure_type == 'type_b','value'].min())
               .reset_index().assign(measure_type='type_c')
               .rename(columns={0:'value'})]).replace(0,pd.np.nan)

让我们试试:

pd.concat([tidy,
           tidy.groupby(['sensor'])
               .apply(lambda x: x.iloc[0,2]-x.iloc[1,2])
               .reset_index().assign(measure_type='type_c')
               .rename(columns={0:'value'})])

输出:

  measure_type                                sensor     value
0       type_a  3bbbe393-74bc-4c77-b95c-fbaaac64ed3f  0.638573
1       type_a  b9b72088-078a-4dd6-91b5-f9e6643a9d43  0.468320
2       type_a  4f90f177-0ed8-4ff5-b635-f317925aebcc  0.945822
3       type_a  307db09c-6b46-4518-b822-7771ab97fbbe  0.886271
4       type_a  061bf0f3-9870-4426-9327-a9e7d9208923  0.757897
5       type_b  3bbbe393-74bc-4c77-b95c-fbaaac64ed3f  0.922330
6       type_b  b9b72088-078a-4dd6-91b5-f9e6643a9d43  0.711345
7       type_b  4f90f177-0ed8-4ff5-b635-f317925aebcc  0.501771
8       type_b  307db09c-6b46-4518-b822-7771ab97fbbe  0.381833
9       type_b  061bf0f3-9870-4426-9327-a9e7d9208923  0.399346
0       type_c  061bf0f3-9870-4426-9327-a9e7d9208923  0.358551
1       type_c  307db09c-6b46-4518-b822-7771ab97fbbe  0.504438
2       type_c  3bbbe393-74bc-4c77-b95c-fbaaac64ed3f -0.283757
3       type_c  4f90f177-0ed8-4ff5-b635-f317925aebcc  0.444052
4       type_c  b9b72088-078a-4dd6-91b5-f9e6643a9d43 -0.243025

答案 1 :(得分:2)

tidy.set_index(['sensor', 'measure_type']).value.unstack() \
    .eval('type_c = type_a - type_b', inplace=False).stack() \
    .sort_index(level='measure_type').swaplevel(0, 1).reset_index(name='value')

   measure_type                                sensor     value
0        type_a  4a384d86-6288-49f3-be5d-8d54a811b9bd  0.312051
1        type_a  d59d29ec-32cb-4853-b822-8ac9abec07b9  0.357074
2        type_a  e59f5497-eb25-4084-816a-297d67768891  0.750661
3        type_a  f3724608-3c28-49c7-a237-09b02a75694b  0.727934
4        type_a  f9726059-1352-49fb-9cc7-cffdf84db325  0.323960
5        type_b  4a384d86-6288-49f3-be5d-8d54a811b9bd  0.095980
6        type_b  d59d29ec-32cb-4853-b822-8ac9abec07b9  0.759485
7        type_b  e59f5497-eb25-4084-816a-297d67768891  0.382245
8        type_b  f3724608-3c28-49c7-a237-09b02a75694b  0.608558
9        type_b  f9726059-1352-49fb-9cc7-cffdf84db325  0.424161
10       type_c  4a384d86-6288-49f3-be5d-8d54a811b9bd  0.216071
11       type_c  d59d29ec-32cb-4853-b822-8ac9abec07b9 -0.402411
12       type_c  e59f5497-eb25-4084-816a-297d67768891  0.368416
13       type_c  f3724608-3c28-49c7-a237-09b02a75694b  0.119376
14       type_c  f9726059-1352-49fb-9cc7-cffdf84db325 -0.100201