我正在研究一个数据集,经过所有的清理和重组后,我遇到了数据集如下所示的情况。
import pandas as pd
df = pd.read_csv('data.csv', dtype={'freq_no': object, 'sequence': object, 'field': object})
print(df)
CSV网址: https://pastebin.com/raw/nkDHEXQC
id year period freq_no sequence file_date data_date field \
0 abcdefghi 2018 A 001 001 20180605 20180331 05210
1 abcdefghi 2018 A 001 001 20180605 20180331 05210
2 abcdefghi 2018 A 001 001 20180605 20180331 05210
3 abcdefghi 2018 A 001 001 20180605 20180330 05220
4 abcdefghi 2018 A 001 001 20180605 20180330 05220
5 abcdefghi 2018 A 001 001 20180605 20180330 05230
6 abcdefghi 2018 A 001 001 20180605 20180330 05230
value note_type note transaction_type
0 200.0 NaN NaN A
1 NaN B {05210_B:ABC} A
2 NaN U {05210_U:DEFF} D
3 200.0 NaN NaN U
4 NaN U {05220_U:xyz} D
5 100.0 NaN NaN D
6 NaN U {05230_U:lmn} A
我想在上面进行重组,使它看起来如下所示。
逻辑:
id, year, period, freq_no, sequence, data_date
作为密钥(groupby?)field
成为列,此列的值为value
combined_note
(相同的密钥)note
列
deleted
列,根据note
显示已删除的value
或transaction_type D
。输出:
id year period freq_no sequence file_date data_date 05210 \
0 abcdefghi 2018 A 001 001 20180605 20180331 200.0
1 abcdefghi 2018 A 001 001 20180605 20180330 NaN
05220 05230 combined_note deleted
0 NaN NaN {05210_B:ABC}{05210_U:DEFF} note{05210_U:DEFF} #because for note 05210_U:DEFF the trans_type was D
1 200.0 100.0 {05220_U:xyz}{05230_U:lmn} note{05220_U:xyz}|05230 #because for note {05220_U:xyz} trans_type is D, we also show field (05230) here separated by pipe because for that row the trans_type is D
我认为可以在set_index
上使用key
然后重构其他列但我无法获得所需的输出。
答案 0 :(得分:2)
所以我最后不得不合并 逻辑步骤:
代码:
import pandas as pd
import io
pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
# url = "https://pastebin.com/raw/nkDHEXQC"
csv_string = b"""id,year,period,freq_no,sequence,file_date,data_date,field,value,note_type,note,transaction_type
abcdefghi,2018,A,001,001,20180605,20180331,05210,200,,,A
abcdefghi,2018,A,001,001,20180605,20180331,05210,,B,{05210_B:ABC},A
abcdefghi,2018,A,001,001,20180605,20180331,05210,,U,{05210_U:DEFF},D
abcdefghi,2018,A,001,001,20180605,20180330,05220,200,,,U
abcdefghi,2018,A,001,001,20180605,20180330,05220,,U,{05220_U:xyz},D
abcdefghi,2018,A,001,001,20180605,20180330,05230,100,,,D
abcdefghi,2018,A,001,001,20180605,20180330,05230,,U,{05230_U:lmn},A
"""
data = io.BytesIO(csv_string)
df = pd.read_csv(data, dtype={'freq_no': object, 'sequence': object, 'field': object})
# so the aggregation function will work
df['note'] = df['note'].fillna('')
grouped = df.groupby(
['id', 'year', 'period', 'freq_no', 'sequence', 'data_date', 'file_date', 'field', 'transaction_type']).agg(['sum'])
grouped.columns = grouped.columns.droplevel(1)
grouped.reset_index(['field', 'transaction_type'], inplace=True)
gcolumns = ['id', 'year', 'period', 'freq_no', 'sequence', 'data_date', 'file_date']
def is_deleted(note, trans_type, field):
"""Determines if a note is deleted"""
deleted = []
for val, val2 in zip(note, trans_type):
if val != "":
if val2 == 'D':
deleted.append(val)
else:
deleted.append('')
else:
deleted.append('')
return pd.Series(deleted, index=note.index)
# This function will add the deleted notes
# I am not sure of the pipe operator, i will leave that to you
grouped['deleted'] = is_deleted(grouped['note'], grouped['transaction_type'], grouped['field'])
# This will obtain all agg of all the notes and deleted
notes = grouped.drop(['field', 'transaction_type', 'value'], axis=1).reset_index().groupby(gcolumns).agg(sum)
# converts two columns into new columns using specified table
# using pivot table to take advantage of the multi index
stacked_values = grouped.pivot_table(index=gcolumns, columns='field', values='value')
# finally merge the notes and stacked_value on their index
final = stacked_values.merge(notes, left_index=True, right_index=True).rename(columns={'note': 'combined_note'}).reset_index()
输出:
final
id year period freq_no sequence data_date file_date 05210 05220 05230 combined_note deleted
0 abcdefghi 2018 A 001 001 20180330 20180605 NaN 200.0 100.0 {05220_U:xyz}{05230_U:lmn} {05220_U:xyz}
1 abcdefghi 2018 A 001 001 20180331 20180605 200.0 NaN NaN {05210_B:ABC}{05210_U:DEFF} {05210_U:DEFF}