I have a function that works successfully to extract data from inside a dataframe when it looks like this:
company created_at notes
{'id': 'eb904b4b', 'name': 'B'} 2018-06-04T13:57:02Z Digging Holes
{'id': 'da2dc806', 'name': 'K'} 2018-06-04T13:57:02Z Drinking Tea
{'id': 'eb904b4b', 'name': 'B'} 2018-05-11T08:52:23Z Cbales
{'id': '3d38dcb7', 'name': 'B'} 2018-05-11T08:52:23Z Digg
To give this:
company_id company_name created_at notes
eb904b4b B 2018-06-04T13:57:02Z Digging Holes
da2dc806 K 2018-06-04T13:57:02Z Drinking Tea
eb904b4b B 2018-05-11T08:52:23Z Cbales
3d38dcb7 B 2018-05-11T08:52:23Z Digg
However if the column company has a blank value then the function fails as it expects a value. I cant figure out how to make my code skip the blanks and carry on...
E.g:
company created_at notes
{'id': 'eb904b4b', 'name': 'B'} 2018-06-04T13:57:02Z Digging Holes
2018-06-04T13:57:02Z Drinking Tea
{'id': 'eb904b4b', 'name': 'B'} 2018-05-11T08:52:23Z Cbales
{'id': '3d38dcb7', 'name': 'B'} 2018-05-11T08:52:23Z Digg
The code that works for the full frame is as so:
def shallow_extract(column, df_parent):
temp_frame = pandas.DataFrame(x for x in df_parent[column])
temp_frame.columns = [f"{column}_{str(col)}" for col in temp_frame.columns]
return pandas.concat([df_parent.drop([column], axis=1), temp_frame.apply(pandas.Series)], axis=1)
EDIT: Dropping the rows is not an option as the other fields can contain data needed. The code also needs to be able to accept multiple dataframes with differing positions and names of columns to extract (as given by the second parameter in the function)
答案 0 :(得分:1)
You can greatly simplify how you expand your dictionary by using df.company.apply(pd.Series)
. However, if you have empty strings in your DataFrame, using this method will create an empty NaN
column that you need to drop.
Setup:
{'company': [{'id': 'eb904b4b', 'name': 'B'},
{'id': 'da2dc806', 'name': 'K'},
{'id': 'eb904b4b', 'name': 'B'},
{'id': '3d38dcb7', 'name': 'B'},
''],
'created_at': ['2018-06-04T13:57:02Z',
'2018-06-04T13:57:02Z',
'2018-05-11T08:52:23Z',
'2018-05-11T08:52:23Z',
'2018-05-11T08:52:23Z'],
'notes': ['Diggin holes', 'Drinking Tea', 'Cbales', 'Digg', 'Other']}
You can use this helpful function to do what you want (I used errors='ignore'
in case no blank columns exist when you expand):
def explode_deplode(column, df):
return df.join(df[column] \
.apply(pd.Series).drop(0, 1, errors='ignore') \
.add_prefix('{}_'.format(column))) \
.drop(column, 1)
In action:
explode_deplode('company', df)
created_at notes company_id company_name
0 2018-06-04T13:57:02Z Diggin holes eb904b4b B
1 2018-06-04T13:57:02Z Drinking Tea da2dc806 K
2 2018-05-11T08:52:23Z Cbales eb904b4b B
3 2018-05-11T08:52:23Z Digg 3d38dcb7 B
4 2018-05-11T08:52:23Z Other NaN NaN