Question

我正在研究一个数据集，经过所有的清理和重组后，我遇到了数据集如下所示的情况。

import pandas as pd

df = pd.read_csv('data.csv', dtype={'freq_no': object, 'sequence': object, 'field': object})
print(df)

CSV网址： https://pastebin.com/raw/nkDHEXQC

          id  year period freq_no sequence  file_date  data_date  field  \
0  abcdefghi  2018      A     001      001   20180605   20180331  05210   
1  abcdefghi  2018      A     001      001   20180605   20180331  05210   
2  abcdefghi  2018      A     001      001   20180605   20180331  05210   
3  abcdefghi  2018      A     001      001   20180605   20180330  05220   
4  abcdefghi  2018      A     001      001   20180605   20180330  05220   
5  abcdefghi  2018      A     001      001   20180605   20180330  05230   
6  abcdefghi  2018      A     001      001   20180605   20180330  05230   

   value note_type            note transaction_type  
0  200.0       NaN             NaN                A  
1    NaN         B   {05210_B:ABC}                A  
2    NaN         U  {05210_U:DEFF}                D  
3  200.0       NaN             NaN                U  
4    NaN         U   {05220_U:xyz}                D  
5  100.0       NaN             NaN                D  
6    NaN         U   {05230_U:lmn}                A

我想在上面进行重组，使它看起来如下所示。

逻辑：

使用id, year, period, freq_no, sequence, data_date作为密钥（groupby？）
转置，field成为列，此列的值为value
通过连接combined_note（相同的密钥）

note

创建一个deleted列，根据note显示已删除的value或transaction_type D。

输出：

          id  year period freq_no sequence  file_date  data_date  05210  \
0  abcdefghi  2018      A     001      001   20180605   20180331  200.0   
1  abcdefghi  2018      A     001      001   20180605   20180330    NaN   

   05220  05230                combined_note              deleted  
0    NaN    NaN  {05210_B:ABC}{05210_U:DEFF}        note{05210_U:DEFF} #because for note 05210_U:DEFF the trans_type was D  
1  200.0  100.0   {05220_U:xyz}{05230_U:lmn}  note{05220_U:xyz}|05230 #because for note {05220_U:xyz} trans_type is D, we also show field (05230) here separated by pipe because for that row the trans_type is D

我认为可以在set_index上使用key然后重构其他列但我无法获得所需的输出。

Answer 1

所以我最后不得不合并逻辑步骤：

除了note和value之外的所有字段对DataFrame进行分组。这是为了保持字段和事务列不受聚合的影响。
添加已删除的列。
包含备注聚合的第一个DataFrame（也已删除）。
将字段和值转换为多列的第二个DataFrame。
合并索引上的第一个和第二个数据框。

代码：

import pandas as pd
import io

pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

# url = "https://pastebin.com/raw/nkDHEXQC"
csv_string = b"""id,year,period,freq_no,sequence,file_date,data_date,field,value,note_type,note,transaction_type
abcdefghi,2018,A,001,001,20180605,20180331,05210,200,,,A
abcdefghi,2018,A,001,001,20180605,20180331,05210,,B,{05210_B:ABC},A
abcdefghi,2018,A,001,001,20180605,20180331,05210,,U,{05210_U:DEFF},D
abcdefghi,2018,A,001,001,20180605,20180330,05220,200,,,U
abcdefghi,2018,A,001,001,20180605,20180330,05220,,U,{05220_U:xyz},D
abcdefghi,2018,A,001,001,20180605,20180330,05230,100,,,D
abcdefghi,2018,A,001,001,20180605,20180330,05230,,U,{05230_U:lmn},A
"""
data = io.BytesIO(csv_string)
df = pd.read_csv(data, dtype={'freq_no': object, 'sequence': object, 'field': object})

# so the aggregation function will work
df['note'] = df['note'].fillna('')
grouped = df.groupby(
    ['id', 'year', 'period', 'freq_no', 'sequence', 'data_date', 'file_date', 'field', 'transaction_type']).agg(['sum'])

grouped.columns = grouped.columns.droplevel(1)
grouped.reset_index(['field', 'transaction_type'], inplace=True)
gcolumns = ['id', 'year', 'period', 'freq_no', 'sequence', 'data_date', 'file_date']


def is_deleted(note, trans_type, field):
    """Determines if a note is deleted"""
    deleted = []
    for val, val2 in zip(note, trans_type):
        if val != "":
            if val2 == 'D':
                deleted.append(val)
            else:
                deleted.append('')
        else:
            deleted.append('')
    return pd.Series(deleted, index=note.index)


# This function will add the deleted notes
# I am not sure of the pipe operator, i will leave that to you
grouped['deleted'] = is_deleted(grouped['note'], grouped['transaction_type'], grouped['field'])

# This will obtain all agg of all the notes and deleted
notes = grouped.drop(['field', 'transaction_type', 'value'], axis=1).reset_index().groupby(gcolumns).agg(sum)

# converts two columns into new columns using specified table
# using pivot table to take advantage of the multi index
stacked_values = grouped.pivot_table(index=gcolumns, columns='field', values='value')

# finally merge the notes and stacked_value on their index
final = stacked_values.merge(notes, left_index=True, right_index=True).rename(columns={'note': 'combined_note'}).reset_index()

输出：

final
          id  year period freq_no sequence  data_date  file_date  05210  05220  05230                combined_note         deleted
0  abcdefghi  2018      A     001      001   20180330   20180605    NaN  200.0  100.0   {05220_U:xyz}{05230_U:lmn}   {05220_U:xyz}
1  abcdefghi  2018      A     001      001   20180331   20180605  200.0    NaN    NaN  {05210_B:ABC}{05210_U:DEFF}  {05210_U:DEFF}

根据给定的密钥重构数据帧

1 个答案: