Question

I have a function that works successfully to extract data from inside a dataframe when it looks like this:

company                            created_at              notes
{'id': 'eb904b4b', 'name': 'B'}    2018-06-04T13:57:02Z    Digging Holes
{'id': 'da2dc806', 'name': 'K'}    2018-06-04T13:57:02Z    Drinking Tea
{'id': 'eb904b4b', 'name': 'B'}    2018-05-11T08:52:23Z    Cbales
{'id': '3d38dcb7', 'name': 'B'}    2018-05-11T08:52:23Z    Digg

To give this:

company_id  company_name    created_at              notes
eb904b4b    B               2018-06-04T13:57:02Z    Digging Holes
da2dc806    K               2018-06-04T13:57:02Z    Drinking Tea
eb904b4b    B               2018-05-11T08:52:23Z    Cbales
3d38dcb7    B               2018-05-11T08:52:23Z    Digg

However if the column company has a blank value then the function fails as it expects a value. I cant figure out how to make my code skip the blanks and carry on...

E.g:

company                            created_at              notes
{'id': 'eb904b4b', 'name': 'B'}    2018-06-04T13:57:02Z    Digging Holes
                                   2018-06-04T13:57:02Z    Drinking Tea
{'id': 'eb904b4b', 'name': 'B'}    2018-05-11T08:52:23Z    Cbales
{'id': '3d38dcb7', 'name': 'B'}    2018-05-11T08:52:23Z    Digg

The code that works for the full frame is as so:

def shallow_extract(column, df_parent):
    temp_frame = pandas.DataFrame(x for x in df_parent[column])
    temp_frame.columns = [f"{column}_{str(col)}" for col in temp_frame.columns]
    return pandas.concat([df_parent.drop([column], axis=1), temp_frame.apply(pandas.Series)], axis=1)

EDIT: Dropping the rows is not an option as the other fields can contain data needed. The code also needs to be able to accept multiple dataframes with differing positions and names of columns to extract (as given by the second parameter in the function)

Answer 1

You can greatly simplify how you expand your dictionary by using df.company.apply(pd.Series). However, if you have empty strings in your DataFrame, using this method will create an empty NaN column that you need to drop.

Setup:

{'company': [{'id': 'eb904b4b', 'name': 'B'},
  {'id': 'da2dc806', 'name': 'K'},
  {'id': 'eb904b4b', 'name': 'B'},
  {'id': '3d38dcb7', 'name': 'B'},
  ''],
 'created_at': ['2018-06-04T13:57:02Z',
  '2018-06-04T13:57:02Z',
  '2018-05-11T08:52:23Z',
  '2018-05-11T08:52:23Z',
  '2018-05-11T08:52:23Z'],
 'notes': ['Diggin holes', 'Drinking Tea', 'Cbales', 'Digg', 'Other']}

You can use this helpful function to do what you want (I used errors='ignore' in case no blank columns exist when you expand):

def explode_deplode(column, df):
    return df.join(df[column]                      \
    .apply(pd.Series).drop(0, 1, errors='ignore')  \
    .add_prefix('{}_'.format(column)))             \
    .drop(column, 1)

In action:

explode_deplode('company', df)

             created_at         notes company_id company_name
0  2018-06-04T13:57:02Z  Diggin holes   eb904b4b            B
1  2018-06-04T13:57:02Z  Drinking Tea   da2dc806            K
2  2018-05-11T08:52:23Z        Cbales   eb904b4b            B
3  2018-05-11T08:52:23Z          Digg   3d38dcb7            B
4  2018-05-11T08:52:23Z         Other        NaN          NaN

Pandas Data Frame extracting data into new column while skipping blank values

1 个答案: