根据给定的密钥重构数据帧

时间:2018-06-14 05:03:55

标签: python-3.x pandas pandas-groupby

我正在研究一个数据集,经过所有的清理和重组后,我遇到了数据集如下所示的情况。

import pandas as pd

df = pd.read_csv('data.csv', dtype={'freq_no': object, 'sequence': object, 'field': object})
print(df)

CSV网址: https://pastebin.com/raw/nkDHEXQC

          id  year period freq_no sequence  file_date  data_date  field  \
0  abcdefghi  2018      A     001      001   20180605   20180331  05210   
1  abcdefghi  2018      A     001      001   20180605   20180331  05210   
2  abcdefghi  2018      A     001      001   20180605   20180331  05210   
3  abcdefghi  2018      A     001      001   20180605   20180330  05220   
4  abcdefghi  2018      A     001      001   20180605   20180330  05220   
5  abcdefghi  2018      A     001      001   20180605   20180330  05230   
6  abcdefghi  2018      A     001      001   20180605   20180330  05230   

   value note_type            note transaction_type  
0  200.0       NaN             NaN                A  
1    NaN         B   {05210_B:ABC}                A  
2    NaN         U  {05210_U:DEFF}                D  
3  200.0       NaN             NaN                U  
4    NaN         U   {05220_U:xyz}                D  
5  100.0       NaN             NaN                D  
6    NaN         U   {05230_U:lmn}                A 

我想在上面进行重组,使它看起来如下所示。

逻辑:

  1. 使用id, year, period, freq_no, sequence, data_date作为密钥(groupby?)
  2. 转置,field成为列,此列的值为value
  3. 通过连接combined_note(相同的密钥)
  4. 创建note
  5. 创建一个deleted列,根据note显示已删除的valuetransaction_type D
  6. 输出:

              id  year period freq_no sequence  file_date  data_date  05210  \
    0  abcdefghi  2018      A     001      001   20180605   20180331  200.0   
    1  abcdefghi  2018      A     001      001   20180605   20180330    NaN   
    
       05220  05230                combined_note              deleted  
    0    NaN    NaN  {05210_B:ABC}{05210_U:DEFF}        note{05210_U:DEFF} #because for note 05210_U:DEFF the trans_type was D  
    1  200.0  100.0   {05220_U:xyz}{05230_U:lmn}  note{05220_U:xyz}|05230 #because for note {05220_U:xyz} trans_type is D, we also show field (05230) here separated by pipe because for that row the trans_type is D
    

    我认为可以在set_index上使用key然后重构其他列但我无法获得所需的输出。

1 个答案:

答案 0 :(得分:2)

所以我最后不得不合并 逻辑步骤:

  1. 除了note和value之外的所有字段对DataFrame进行分组。这是为了保持字段和事务列不受聚合的影响。
  2. 添加已删除的列。
  3. 包含备注聚合的第一个DataFrame(也已删除)。
  4. 将字段和值转换为多列的第二个DataFrame。
  5. 合并索引上的第一个和第二个数据框。
  6. 代码:

    import pandas as pd
    import io
    
    pd.set_option('display.height', 1000)
    pd.set_option('display.max_rows', 500)
    pd.set_option('display.max_columns', 500)
    pd.set_option('display.width', 1000)
    
    # url = "https://pastebin.com/raw/nkDHEXQC"
    csv_string = b"""id,year,period,freq_no,sequence,file_date,data_date,field,value,note_type,note,transaction_type
    abcdefghi,2018,A,001,001,20180605,20180331,05210,200,,,A
    abcdefghi,2018,A,001,001,20180605,20180331,05210,,B,{05210_B:ABC},A
    abcdefghi,2018,A,001,001,20180605,20180331,05210,,U,{05210_U:DEFF},D
    abcdefghi,2018,A,001,001,20180605,20180330,05220,200,,,U
    abcdefghi,2018,A,001,001,20180605,20180330,05220,,U,{05220_U:xyz},D
    abcdefghi,2018,A,001,001,20180605,20180330,05230,100,,,D
    abcdefghi,2018,A,001,001,20180605,20180330,05230,,U,{05230_U:lmn},A
    """
    data = io.BytesIO(csv_string)
    df = pd.read_csv(data, dtype={'freq_no': object, 'sequence': object, 'field': object})
    
    # so the aggregation function will work
    df['note'] = df['note'].fillna('')
    grouped = df.groupby(
        ['id', 'year', 'period', 'freq_no', 'sequence', 'data_date', 'file_date', 'field', 'transaction_type']).agg(['sum'])
    
    grouped.columns = grouped.columns.droplevel(1)
    grouped.reset_index(['field', 'transaction_type'], inplace=True)
    gcolumns = ['id', 'year', 'period', 'freq_no', 'sequence', 'data_date', 'file_date']
    
    
    def is_deleted(note, trans_type, field):
        """Determines if a note is deleted"""
        deleted = []
        for val, val2 in zip(note, trans_type):
            if val != "":
                if val2 == 'D':
                    deleted.append(val)
                else:
                    deleted.append('')
            else:
                deleted.append('')
        return pd.Series(deleted, index=note.index)
    
    
    # This function will add the deleted notes
    # I am not sure of the pipe operator, i will leave that to you
    grouped['deleted'] = is_deleted(grouped['note'], grouped['transaction_type'], grouped['field'])
    
    # This will obtain all agg of all the notes and deleted
    notes = grouped.drop(['field', 'transaction_type', 'value'], axis=1).reset_index().groupby(gcolumns).agg(sum)
    
    # converts two columns into new columns using specified table
    # using pivot table to take advantage of the multi index
    stacked_values = grouped.pivot_table(index=gcolumns, columns='field', values='value')
    
    # finally merge the notes and stacked_value on their index
    final = stacked_values.merge(notes, left_index=True, right_index=True).rename(columns={'note': 'combined_note'}).reset_index()
    

    输出:

    final
              id  year period freq_no sequence  data_date  file_date  05210  05220  05230                combined_note         deleted
    0  abcdefghi  2018      A     001      001   20180330   20180605    NaN  200.0  100.0   {05220_U:xyz}{05230_U:lmn}   {05220_U:xyz}
    1  abcdefghi  2018      A     001      001   20180331   20180605  200.0    NaN    NaN  {05210_B:ABC}{05210_U:DEFF}  {05210_U:DEFF}